arXiv:l508.06104v2 [math.NA] 21 Sep 2015 


Fast randomized iteration: diffusion Monte Carlo through the lens 

of numerical linear algebra 

Lek-Heng Lim^ and Jonathan Weare^’^ 

^Department of Statistics, University of Chicago 
^James Franck Institute, University of Chicago 


Abstract 

We review the basic outline of the highly successful diffusion Monte Carlo technique com¬ 
monly used in contexts ranging from electronic structure calculations to rare event simulation 
and data assimilation, and show that aspects of the scheme can be extended to address a variety 
of common tasks in numerical linear algebra. From the point of view of numerical linear algebra, 
the main novelty of the new algorithms is that they work in either linear or constant cost per 
iteration (and in total, under appropriate conditions) and are rather versatile: In this article, 
we will show how they apply to solution of linear systems, eigenvalue problems, and matrix 
exponentiation, in dimensions far beyond the present limits of numerical linear algebra. In fact, 
the schemes that we propose are inspired by recent DMC based quantum Monte Carlo schemes 
that have been applied to matrices as large as 10^°® x 10^°®. We will also provide convergence 
results and discuss the dependence of these results on the dimension of the system. For many 
problems one can expect the total cost of the schemes to be sub-linear in the dimension of the 
problem. So while traditional iterative methods in numerical linear algebra were created in part 
to deal with instances where a matrix (of size 0{n^)) is too big to store, the adaptations of 
DMC that we propose are intended for instances in which even the solution vector itself (of size 
0{n)) may be too big to store or manipulate. 

1 Introduction 

Randomized algorithms have become immensely popular in numerical linear algebra over the last 
decade. Most of the developments to date have focused on low-rank approximations of large matri¬ 
ces (see e.g. [la lig [201 [211 |23l 1371 Eg IMl sa |56] ). For earlier uses of randomization in numerical 
linear algebra see, for example, [I] (in the context of matrix inversion) and |32] (for estimates 
of the trace of a matrix), and for an interesting description of the relationships between Markov 
Chain Monte Carlo schemes and common iterative techniques in numerical linear algebra see m- 
The goal of this article is to, after providing a brief introduction to the highly successful diffusion 
Monte Carlo (DMC) algorithm, suggest a new class of algorithms inspired by DMC for problems in 
numerical linear algebra. DMC is used in applications including electronic structure calculations, 
rare event simulation, and data assimilation, to efficiently approximate expectations of the type 
appearing in Feynman-Kac formulae, i.e., for weighted expectations of Markov processes typically 
associated with parabolic partial differential equations (see e.g. m)- While based on principles 
underlying DMC, the fast randomized iteration (FRI) schemes that we study in this article are 
designed to address arguably the most classical and common tasks in matrix computations: linear 
systems, eigenvector problems, and matrix exponentiation, i.e., solving for v in 

Av = b, Av = Xv, V = exp(A)6 (1) 
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for matrices A that might not have any natural association with a Markov process. 

FRI schemes rely on basic principles similar to those at the core of other methods that have 
appeared recently in the numerical linear algebra literature but they differ substantially in detail and 
in the problems they address. These differences will be remarked on again later, but roughly, while 
most of the randomized linear algebra techniques in the references above rely on a single sampling 
step, FRI methods randomize repeatedly and, as a consequence, require more carefully constructed 
randomizations. FRI schemes are not, however, the first to employ randomization within iterative 
schemes. In fact the strategy of replacing expensive integrals or sums (without immediate stochastic 
interpretation) appearing in iterative protocols has a long history. For example, it was used in 
schemes for the numerical solution of hyperbolic systems of partial differential equations in |12j . 
That strategy is represented today in applications ranging from density functional calculations in 
physics and chemistry (see e.g. m) to maximum likelihood estimation in statistics and machine 
learning (see e.g. M)- Though related in that they rely on repeated randomization within an 
iterative scheme, these schemes differ from the methods we consider in that they do not use a 
stochastic representation of the solution vector itself. In contrast to these and other randomized 
methods that have been used in linear algebra applications, our focus is on problems for which the 
solution vector is extremely large so they can only be treated by linear or constant cost algorithms. 
In fact, the scheme that is our primary focus is ideally suited to problems so large that the solution 
vector itself is too large to store so that no traditional iterative method (even for sparse matrices) 
is appropriate. This is possible because the scheme computes only low dimensional projections of 
the solution vector and not the solution vector itself. The full solution is replaced by a sequence 
of sparse random vectors whose expectations are close the true solution and whose variances are 
small. 

Diffusion Monte Carlo (see e.g. laiHlIIUEllEHlEolEslESlEHDis a central component in the 
quantum Monte Carlo (QMC) approach to computing the electronic ground state energy of the 
Schrodinger-Hamiltonian operator 0 

7iv = —^Av + Uv. (2) 

We are motivated in particular by the work in [8] in which the authors apply a version of the DMC 
procedure to a finite (but huge) dimensional projection of H onto a discrete basis respecting an 
anti-symmetry property of the desired eigenfunction. The approach in [8| and subsequent papers 
have yielded remarkable results in situations where the projected Hamiltonian is an extremely large 
matrix (e.g. x see m) and standard approaches to finite dimensional eigenproblems 

are far from reasonable (see 0 El 0113112 EH). 

The basic DMC approach is also at the core of schemes developed for a number of applications 
beyond electronic structure. In fact, early incarnations of DMC were used in the simulation of small 
probability events [UlllH] for statistical physics models. Also in statistical physics, the transfer 
matrix Monte Carlo (TMMC) method was developed to compute the partition functions of certain 
lattice models by exploiting the observation that the partition function can be represented as the 
dominant eigenvalue of the so-called transfer matrix, a real, positive, and sparse matrix (see mi)- 
TMMC may be regarded as an application of DMC to discrete eigenproblems. DMC has also 
become a popular method for many data assimilation problems and the notion of a “compression” 
operation introduced below is very closely related to the “resampling” strategies developed for those 

problems (see e.g. 1281 Ellis])- 

^The symbol A is used here and below to denote the usual Laplacian operator on functions of R'*, Au = d^.u. 

U is a potential function that acts on v by pointwise multiplication. Though this operator is symmetric, the FRI 
schemes we introduce below are not restricted to symmetric eigenproblems. 
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One can view the basic DMC (or Markov chain Monte Carlo for that matter) procedure as a 
combination of two steps: In one step an integral operator (a Green’s function) is applied to an 
approximate solution consisting of a weighted finite sum of delta functions, in another step the 
resulting function (which is no longer a mixture of delta functions) is again approximated by a 
finite mixture of delta functions. The more delta functions allowed in the mixture, the higher the 
accuracy and cost of the method. A key to understanding the success of these methods is the 
observation that not all delta functions (i.e., at all positions in space) need appear in the mixture. 
A similar observation holds for the methods we introduce: FRI schemes need not access all entries in 
the matrix of interest to yield an accurate solution. In fact, we prove that the cost to achieve a fixed 
level of accuracy with our methods can be bounded independently of the size of the matrix, though 
in many applications one should expect some dependence on dimension. As with other randomized 
schemes, when an effective deterministic method is available it will very likely outperform the 
methods we propose; our focus is on problems for which satisfactory deterministic alternatives are 
not available (e.g. when the size of the intermediate iterates or final result are so large as to prohibit 
any conceivable deterministic methods). Moreover, the schemes that we propose are a supplement 
and not a replacement for traditional dimensional reduction strategies (e.g. intelligent choice of 
basis). Indeed, successful application of DMC within practical QMC applications relies heavily on 
a change of variables based on approximations extracted by other methods (see the discussion of 
importance sampling in |24j I. 

The theoretical results that we provide are somewhat atypical of results commonly presented 
in the numerical linear algebra literature. In the context of linear algebra applications, both DMC 
and FRI schemes are most naturally viewed as randomizations of standard iterative procedures and 
their performance is largely determined by the structure of the particular deterministic iteration 
being randomized. For this reason, as well as to avoid obscuring the essential issues with the details 
of individual cases, we choose to frame our results in terms of the difference between the iterates vt 
produced by a general iterative scheme and the iterates generated by the corresponding randomized 
scheme Vt (rather than considering the difference between V) and limt^oo^^t)- In ideal situations 
our bounds are of the form 


\\\yt-vt\\\= sup 

/6C", ll/ll 


<1 




c 


m 


where the constant C is independent of the dimension n of the problem and m < n controls the 
cost per iteration of the randomized scheme (one iteration of the randomized scheme is roughly 
a factor of n/m less costly than its deterministic counterpart and the two schemes are identical 
when m = n). The norm in the above display measures the root mean squared deviation in low 
dimensional projections of the iterates. This choice is important as described in more detail in 
Sections an di For more general applications, one can expect the constant C, which incorporates 
the stability properties of the randomized iteration, to depend on dimension. In the worst case 
scenario, the randomized scheme is no more efficient than its deterministic counterpart (in other 
words, reasonable performance may require m ~ n). Our numerical simulations, in which n/m 
ranges roughly between 10^ and 10^^, strongly suggest that this scenario may be rare. 

We will begin our development in Section with a description of the basic diffusion Monte 
Carlo procedure. Next, in Section we describe how ideas borrowed from DMC can be applied to 
general iterative procedures in numerical linear algebra. As a prototypical example, we describe 
how randomization can be used to dramatically decrease the cost of finding the dominant eigenvalue 
and (projections of) the dominant eigenvector, first to 0(n) in the dimension of the problem and 
then to 0(1). Also in that section, we consider the specific case in which the iteration mapping is 
an e-perturbation of the identity, relevant to a wide range of applications involving evolutionary 
differential equations. In this case a poorly chosen randomization scheme can result in an unstable 
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algorithm while a well chosen randomization can result in an error that decreases with s (over £~^ 
iterations). Next, in Sections]^ andwe establish several simple bounds regarding the stability and 
error of our schemes. Finally, in Sectionj^we provide three computational examples to demonstrate 
the performance of our approach. A simple, educational implementation of Fast Randomized 
Iteration applied to our first computational example can be found here m- 

Remark 1. In several places we have included remarks that clarify or emphasize concepts that may 
otherwise be unclear to readers more familiar with classical, deterministic, numerical linear algebra 
methods. We anticipate that some of these remarks will be useful to this article’s broader audience 
as well. 


2 Diffusion Monte Carlo within quantum Monte Carlo 

The ground state energy, A*, of a quantum mechanical system governed by the Hamiltonian in © 
is the smallest eigenvalue (with corresponding eigenfunction u*) of the Hermitian operator T-L. The 
starting point for a DMC calculation is the imaginary-time Schrodinger equatioi:[^|^ 

dtv = —Hv (3) 

(for a review of QMC see |24)i. One can, in principle, use power iteration to find A*: beginning 
from an initial guess vq (and assuming a gap between A* and the rest of the spectrum of Ti), the 
iteration 

\t = —log e~^'^vt-iix) dx and vt = . - ~ . , (4) 

E J j E 

will converge to the pair (A*,u*) where u* is the eigenfunction corresponding to A*. 


Remark 2. For readers who are more familiar with the power method in numerical linear algebra, 
this may seem a bit odd but a discrete analogue of @ is just vt = Avt-i/\\Avt-i\\i applied to 
a positive definite matrix A = exp(—eFI) G ^.nxn jg Hermitiarj^ The slight departure 

from the usual power method is only in (i) normalizing by a 1-norm (or rather, by the sum of 
entries since both v and A are non-negative), and (ii) iterating on exp(—eFF) instead of 

on FF directly (the goal in this context is to find the smallest eigenvalue of FF not the magnitude- 
dominant eigenvalue of FF). The iteration on the eigenvalue is then Xt = — log||Aut_i||i since 
A*(FF) = —log A*(A). 


The first step in any (deterministic or stochastic) practical implementation of Q is discretiza¬ 
tion of the operator Diffusion Monte Carlo often uses the second order time discretization 




FF, = e-i^ei^e-i^. 


A standard deterministic approach would then proceed by discretizing the operator in space and 
replacing in Q with the space and time discretized approximate operator. The number of 
spatial discretization points required by such a scheme to achieve a fixed accuracy will, in general, 
grow exponentially in the dimension d of the argument of FF. 


^The reader familiar with quantum mechanics but unfamiliar with QMC may wonder why we begin with the 
imaginary time Schrodinger equation and not the usual Schrodinger equation idtv = —Hv. The reason is that while 
the solutions to both equations can be expanded in terms of the eigenfunction of FF, for the usual Schrodinger 
equation the contributions to the solution from eigenfunctions with larger eigenvalues do not decay relative to the 
ground state. By approximating the solution to the real time equation for large times we can approximate the ground 
state eigenfunction of FF. 

^ In practical QMC applications one solves for the p = v*v where v is an approximate solution found in advance 
by other methods. The new function p is the ground state eigenfunction of a Hamiltonian of the form Hv = 
— |Au + div(6u) + Uv. The implications for the discussion in this section are minor. 

"^Note that FF, which is a matrix is not the same as FF which is an operator on a function space and is only 
introduced for the purposes of relating the expression in 0 to the usual power method for matrices 
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Diffusion Monte Carlo uses two randomization steps to avoid this explosion in cost as d increases. 
These randomizations have the effect of ensuring that the random approximations of the iterates 
vt are always of the form 


v: 


“w = E 


Nt 

i=i 




where 6y{x) is the Dirac delta function centered at y £ the are real, non-negative numbers 
with = 1) and, for each j < Nt, As will be made clear in a moment, the 

integer m superscripted in our notation controls the number of delta functions, Nt, included in the 
above expression for V^. 

The fact that the function is non-zero at only Nt values is crucial to the efficiency of diffusion 
Monte Carlo. Starting with Nq = m and from an initial condition of the form 


vm 

'^0 




the first factor of e 2 ^ applied to results in 


1 


m 


E 


i=i 




X, 


(i) 


which can be assembled in 0{m) operations. The first of the randomization steps used in DMC 
relies on the well known relationship 


j f{x)[e"^^6y]{x)dx = 'Ey [/{Bs)] 


(5) 


where / is a test function, Bg is a standard Brownian motion evaluated at time s > 0, and the 
subscript on the expectation is meant to indicate that Bq = y (i.e. in the expectation in ([^ Bg is 
a Gaussian random variable with mean y and variance e). In fact, this representation is a special 
case of the Feynman-Kac formula f f(x)[e~^^(5y](x}dx = Ey [/(B£)e“-^o U{Bs)ds~^^ Representation 
Q suggests the approximation 


KeV,^ 


m 






U) 


(i) (i) (i) 

where, conditioned on the Xq \ the ' are independent and ' is normally distributed with mean 
Xq '^ and covariance el (here I is the d x d identity matrix). This first randomization has allowed 
an approximation of by a distribution, V^, that is again supported on only m points in W^. 

One might, therefore, attempt to define a sequence V)™ iteratively by the recursion 


vm 

D-i-i 


vm 

^t+1 


fv^^^(x)dx 

where we have recursively defined the weights 



wii\ = 




with = 1/m for each j. The cost of this randomization procedure is 0{dm) so that the total 
cost of a single iteration is 0{dm). 

At each step the weights in the expression for the iterates are multiplied by additional 
random factors. These factors are determined by the potential U and the positions of the Q . On 
the other hand, the evolve without reference to the potential function U (they are discretely 
sampled points from m independent Brownian motions). As a consequence, over many iterations 
one can expect extreme relative variations in the W/ ' and, therefore, poor accuracy in V)™ as an 
approximation of the functions vt produced by ©• 
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The second randomization employed by the DMC algorithm is the key to controlling the growth 
in variance and generalizations of the idea will be key to designing fast randomized iteration schemes 
in the next section. In order to control the variation in weights, at step t, DMC randomly re- 

(j) (j\ 

moves points Q ' corresponding to small weights and duplicates points corresponding to large 

weights. A new distribution is generated from so that 


E [F™ I 

(i) (i) 

by “resampling” a new collection of m points from the Q ' with associated probabilities . The 

(j) 

resulting points are labeled ' and the new distribution takes the form 


vm _ 1 

The next iterate is then built exactly as before but with replaced by All methods to 


select the generate, for each j, a non-negative integer with 


(j) 


I = mW^ 


(j) 


and then sets of the elements in the collection equal to so that Nt+i = 

■ For example, one popular strategy in DMC generates the independently with 


The above steps define a randomized iterative algorithm to generate approximations of 
vt- The second randomization procedure (generating Y^ from will typically require 0{m) 
operations, preserving the overall 0{dm) per iteration cost of DMC (as we have described it). The 
memory requirements of the scheme are also 0{dm). The eigenvalue A* can be approximated, for 
example, by 



for T large. 

Before moving on to more general problems notice that the scheme just outlined applies just as 
easily to space-time discretizations of e~^^. For example, if we set h = {1 + 2d)e and denote by 

C the uniform rectangular grid with resolution h in each direction, the operator e~'^^ can 
be discretized using 


^-eV. 




where, for any vector g ^ £^, 

^hg{x) = -{-g{x) + -^—Y^ ,, ^,9{y)] 

e \ I + 2d \\y-A\ 2 <h J 

(here we find it convenient to identify functions g —>■ M and vectors in The operator 

g-eAji again has a stochastic representation; now the representation is in terms of a jump Markov 
process with jumps from a point in £f^ to one of its nearest neighbors on the grid (for an interesting 
approach to discretizing stochastic differential equations taking advantage of a similar observation 
see [To]). 


Remark 3. The reader should notice that not only will we be unable to store the matrix 
(which is exponentially large in d) or afford to compute for a general vector v, but we will 

not even be able to store the iterates vt generated by the power method. Even the sparse matrix 
routines developed in numerical linear algebra to deal with large matrices are not reasonable for 
this problem. 
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In this discrete context, a direct application of the DMC approach (as in mi) would represent 
the solution vector as a superposition of standard basis element^ and replace calculation of K^^hV 
by a random approximation whose cost is (for this particular problem) free of any direct dependence 
on the size of (though its cost can depend on structural properties of which may be related 
to its size), whose expectation is exactly K^^hV, and whose variance is small. The approach in [8] 
is also an application of these same basic DMC steps to a discrete problem, though in that case 
the desired eigenvector has entries of a priori unkown sign, requiring that the solution vector be 
represented by a superposition of signed standard basis elements. 

In this article we take a different approach to adapting DMC to discrete problems. Instead of 
reproducing in the discrete setting exactly the steps comprising DMC, consider a slightly modified 
scheme that omits direct randomization of an approximation to and instead relies solely on 

a general random mapping very similar to the map from to but which takes a vector 
G and produces a new random vector Y)'” with m, or nearly m, non-zero components 
and satisfying E [Y^ \ V)™] as above. Starting from a non-negative initial vector G with 
||V)J"||i = 1 and with at most 0{md) non-zero entries, is generated from V)™ as follows: 


Step 1. Generate Y)™ = ‘h™ (T)™) with at most m non-zero entries. 


Step 2. Set = 


Ke,hYr 

\Ke,hYr\\ 


For example, adapting the popular resampling choice mentioned above, we might choose the entries 
of Y)"* independently with 

P[(rr), = [rn{vr)s\lm] = \m[vr)i\ - , 

Note that the iterates generated by this scheme have at most 0{dm) non-zero entries. The 

details of the mappings (which we call compression mappings) will be described later in Section]^ 
where, for example, we will find that the cost of applying the mapping will typically be 0{dm) 
when its argument has 0{dm) non-zero entries. And while the cost of applying to an arbitrary 
vector in is |T)(|, the cost of applying to a vector with m non-zero entries is only 0{md). 
The total cost of the scheme is therefore 0{md) in storage and operations per iteration. These 
cost requirements are dependent on the particular spatial discretization of if we had chosen 
a discretization corresponding to a dense matrix h then the cost reduction would be much less 
extreme. Nonetheless, as we will see in Sectionj^ the scheme just described can be easily generalized 
and, as we will see in Sections and will often result in methods that are significantly faster 
than their deterministic counterparts. 

Even restricting ourselves to the quantum Monte Carlo context, there is ample motivation to 
generalize the DMC scheme. Often one wishes to approximate not the smallest eigenvalue of % 
but instead the smallest eigenvalue corresponding to an antisymmetric (in exchange of particle 
positions) eigenfunction. DMC as described in this section, cannot be applied directly to com¬ 
puting this value, a difficulty commonly referred to as the Fermion sign problem. Several authors 
have attempted to address this issue with various modifications of DMC. In particular, Alavi and 
coworkers recently developed a version of DMC for a particular spatial disretization of the Hamilto¬ 
nian (in the configuration interaction basis) that exactly preserves antisymmetry (unlike the finite 
difference discretization we just described). Run to convergence, their method provides the same 
approximation as the so called full Cl method but can be applied with a much larger basis (e.g. in 

®The delta functions represent the indices of vt that we are keeping track of; is more commonly denoted ej in 
numerical linear algebra — the standard basis vector with 1 in the ^th coordinate and zero elsewhere. 
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experiments by Alavi and coworkers reported in m up to total functions in the expansion 

of the solution). Though it differs substantially in its details from the scheme proposed in [8], the 
generalizations of DMC represented by the two step procedure in the last paragraph and by the 
developments in the next section can also be applied to a wider range of discretizations of 7i. Finally 
we remark that, while we have considered DMC in the particular context of computing the ground 
state energy of a Hamiltonian, the method is used for a much wider variety of tasks with only minor 
modification to its basic structure. For example, particle filters (see e.g. m) are an application 
DMC to on-line data assimilation and substantive differences are mostly in the interpretation of 
the operator to which DMC is applied (and the fact that one is typically interested in the solution 
after finitely many iterations). 


3 A general framework 

Consider the general iterative procedure, 

vt+i=M{vt) (8) 

for vt G C"'. Eigenproblems, linear systems, and matrix exponentiation can each be accomplished 
by versions of this iteration. In each of those settings the cost of evaluating Ai(v) is dominated by 
a matrix-vector multiplication. We assume that the cost (in terms of floating point operations and 
storage) of performing the required matrix-vector multiplication makes it impossible to carry out 
recursion Q to the desired precision. As in the steps described at the end of the last section, we 
will consider the error resulting from replacement of Q by 

V,^,=Mmvn)- ( 9 ) 

where the compression maps : C"’ —>• are independent, inexpensive to evaluate, and enforce 
sparsity in the iterates (the number of non-zero entries in will be 0{m)) so that M can 
be evaluated at much less expense. When Ad is a perturbation of identity and an 0{n) scheme is 
appropriate (see Sections 3.2 and below) we will also consider the scheme, 

H-i = vr+M {^T{vn) - ^Tivn- (lo) 

The compressions will satisfy (or very nearly satisfy) the statistical consistency criterion 

E[chr(n)] = n (11) 

and will have to be carefully constructed to avoid instabilities and yield effective methods. For 
definiteness one can imagine that is defined by a natural extension of 0 







m\vj\ 


m\vj\ 

_ Iklli . 



m\vj\ 


m\vj\ 

ikiii 



mWi 


T 1 


m\Vi 


1 


m\v 


j\ 


Till 


m\V4 


Till 


( 12 ) 


to accept arguments v G C"" (F)™ is no longer a non-negative real number). This choice has several 
drawbacks, not least of which is its cost, and we do not use it in our numerical experiments. Alter¬ 
native compression schemes, including the one used in our numerical simulations, are considered in 
detail in SectionThere we will learn that one can expect that, for any pair f,vG C”', 

^E[|/HcD™(n)-/Hn|2] < ^^||/||^||^;||i (13) 


m 


(the superscript ^ is used throughout this article to denote the conjugate transpose of a vector with 
complex entries). These errors are introduced at each iteration and need to be removed to obtain 
an accurate estimate. Depending on the setting, we may rely on averaging over long trajectories, 
averaging over parallel simulations (replicas), or dynamical self-averaging (see Sections 3.2 and|^, 
to remove the noise introduced by our randomization procedure. Because the specific choice of Ai 
and the form of averaging used to remove noise can differ substantially by setting, we will describe 
the schemes within the context of specific (and common) iterative procedures. 


























3.1. The eigenproblem revisited Consider, for example, a more general eigenproblem than the 
one we considered in Section]^ Given K £ goal is to determine A* G C and n* G C” snch 

that 


Kvif = 


(14) 


and such that, for any other solution pair {^,u), \fi\ < |A*|. In what follows in this section we will 
assume that this problem has a unique solution. The standard methods of approximate solution of 


(14) are variants of the power method, a simple version of which performs 

u^Kvt 


Kvt 


vt+i = 


At+i — 


U^Vt 


(15) 


Under generic conditions, these converge to the correct (A*,n*) starting from an appropriate initial 


vector no (see e.g. m)- The scheme in (|15[) requires 0{nr) work per iteration and at least O (n) 


storage. In this article, we are interested in situations in which these cost and storage requirements 
are unreasonable. 


Prom O (n^) to O (n) For the iteration in ( |15[ ) the randomized scheme (© (along with an 
approximation of A*) becomes 


T/m _ 

U+1 — 




\m _ 
— 




(16) 


U*+1 = (1 - et)vT + AZi, = (1 - et)AT + i, 

where et is a sequence of positive numbers with = oo and < oo (if we choose 

St = then Vt and are trajectory averages). In ( |16[ ), the compressions are independent of 
one another. Using the rules defining in Section construction of <I>j”(Vj™') at each step will 
require 0{n) operations. Since multiplication of the vector <I>j"(Vj™') by a dense matrix K requires 
0{nm) operations, this scheme has an 0{n) storage and cost per iteration requirement. For exactly 


the same reasons, iterations based on (10) will also have cost and storage requirements of 0{n). As 
we will see in Section [4| w hen the mapping A4 is of the form v + eb{v) for some small parameter e, 
methods of the form (10) can be expected to have smaller error than iteration ([^. 

From 0{n) to 0(1) For many problems even 0(n) cost and storage requirements are unaccept¬ 
able. But now suppose that K is sparse with q non-zero entries per column. In this case not only 
would the cost of computing the matrix vector product Ar<h™(u) reduce to 0(qm), but, according 


to (16), the vectors would now have at most 0{qm) non zero entries and the cost of computing 
would reduce to O (qm) operations as well. Thus when K is sparse the total number of 


floating point operations required by (16) reduces to O (qm) per iteration. This observation does 


not hold for methods of the form (10) which will typically result in dense iterates and a cost 
of 0(n) even when K is sparse. 

As we have mentioned (and as was true in Section]^, in many settings even storing the full 
solution vector is impossible. Overcoming this impediment requires a departure from the usual 
perspective of numerical linear algebra. Instead of trying to approximate all entries of n*, our goal 
becomes to compute 

f* = rv, 

for some matrix / with n columns and many fewer (0(1) in n) rows. This change in perspective 


is reflected in the form of our compression rule error estimate in (13) and in the form of our 


convergence results in Section that measure error in terms of dot products with test vectors. 
Indeed, were we to estimate a more standard quantity such as 

E[||$r(^)-n||i] 
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we would find that the error decreased proportional to (n — m)/n requiring that m increase with 
n to achieve fixed accuracy. The consequence in terms of the algorithm is simply the removal in 


(16) of the equation defining Vi and insertion of 


T?m 


_ /HT/m 

— J 


and FT^, = {l-et)FT + etF^ 


t+i 


(17) 

which produces an estimate T™ of /*. 

Remark 4. While estimation of /* may seem an unusual goal in the context of classical iterative 
schemes it is completely in line with the goals of any Markov chain Monte Carlo scheme which 
typically seek only to compute averages with respect to the invariant measure of a Markov chain 
and not to completely characterize that measure. 

Schemes with 0(1} storage and operations requirements per-iteration can easily be designed for 
any general matrix. Accomplishing this for a dense matrix requires an additional randomization 
in which columns of K (or of some factor of K) are randomly set to zero independently at each 
iteration, e.g. again in the context of power iteration, assuming that has at most 0{m) non-zero 
entries, one can use 


Vm 

^t+1 




mi ,j 


(Kj) (V,- 


Till 


~mi,j 


(18) 


IS an 


in place of (16), where here Kj is used to denote the jth column of K and each 

TTl^ 

independent copy of ‘. The number of entries retained in each column is controlled by ml 
which, if the norms of the columns of K are known in advance (otherwise one can set ml = 1), can 
be set randomly at each iteration by applying, e.g. ([^ or some other resampling scheme with 
replaced by the vector with jth entry equal to HTCjHiKlj™) j. This alternative approach also leaves 


the next iterate, Ij+i, with at most 0(m) non-zero entries. Use of (18) in place of (16) will result 
in a scheme whose cost per iteration is independent of n if the compressions of the columns have 
cost independent of n. This may be possible without introducing significant error, for example, 
when the entries in the columns of K can take only a small number of values. However, use of 


(18) does not reduce the number of entries of K that must be accessed at any iteration and it will 


always increase error. In our numerical examples we do not apply the approach in (18) and we 


expect that in many applications effective constant cost per iteration schemes will only be possible 
if K is sparse. 

3.2. Peturbations of identity We now consider the case that A4 is a perturbation of the identity, 
that 

M{v) — V = eb{v) + 0 (e) (19) 


i.e. 


where e is a small positive parameter. This case is of particular importance because, when the goal 
is to solve a differential equation initial value problem 

= Ky)^ y(o) = yo, (20) 


dt^ 


discrete-in-time approximations take the form ([^ with Ai of the form in (19). As is the case 
in several of our numerical examples, the solution to (20) may represent, for example, a semi¬ 


discretization (a discretization in space) of a partial differential equation (PDE). 

Several common tasks in numerical linear algebra, not necessarily directly related to ODE or 
PDE can also be addressed by considering (20). For example, suppose that we solve the ordinary 


differential equation (ODE) ( |20[ ) with b(y) = Ay — r for some r G C”’ and any nxn complex valued 
matrix A. The solution to (|20|) in this case is 


y(t) = e^^yo + A-^I-e^^) r. 
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Setting r = 0 in the last display we find that any method to approximate ODE (20) for t = 1 can 


be used to approximate the product of a given vector and the exponential of the matrix A. On 
the other hand, if r 7 ^ 0 and all eigenvalues of A have negative real part then, for very large t, the 


solution to (20) converges to A ^r. In fact, in this case we obtain the continuous time variant of 


Jacobi iteration for the equation Ax = r. Like Jacobi iteration, it can be extended to somewhat 


more general matrices. Discretizing (20) in time with b{y) = Ay — r and a small time step allows 


treatment of matrices with a wider range of eigenvalues than would be possible with standard 
Jacobi iteration. 


Some important eigenproblems are also solved using an Ai satisfying (19). For example, this is 


the case when the goal is to compute the eigenvalue/vector pair corresponding to the eigenvalue of 
largest real part (rather than of largest magnitude) of a differential operator, e.g. the Schrodinger 
operator discussed in Section While the power method applied directly to a matrix A converges 
to the eigenvector of A corresponding to the eigenvalue of largest magnitude, the angle between the 
vector exp (tA) 2/0 and the eigenvector corresponding to the eigenvalue of A with largest real part 


converges to 0 (assuming yo is not orthogonal to that eigenvector). If we discretize (20) in time with 
b{v) = Av and renormalize the solution at each time step (to have unit norm) then the iteration 
will converge to the desired eigenvector (or a e dependent approximation of that eigenvector). 

As we will learn in the next section, designing effective fast randomized iteration schemes for 
these problems requires that the error in the stochastic representation AJ o of Ai decrease suf¬ 
ficiently rapidly with e. In particular, in order for our schemes to accurately approximate solutions 


to (20) over intervals of 0(1) units of time (i.e., over O (l/e) iterations of the discrete scheme), we 
will need, and will verify in certain cases, that 

[|/HM (<l>Y^(v)) - /HAJ(u)|2] ~ 

Obtaining a bound of this type will require that we use a carefully constructed random compression 
such as those described in Section]^ In fact, when a scheme with 0{n) cost per iteration is 
acceptable, iteration ([^ can be replaced by ( 10 ), i.e., by 

V,^, = Vr + ebi<l>TiVn) + o(e) ( 21 ) 

in which case we can expect errors over O (e~^) iterations that vanish with e (rather than merely 
remaining stable). As we will see in more detail in the next section, the principle of dynamic self¬ 
averaging is essential to the convergence of either ([^ or ( 10 ) when AJ is a perturbation of identity. 
The same principle is invoked in the contexts of, for example, multi-scale simulation (see e.g. [45| 
and |22j and the many references therein) and stochastic approximation (see e.g. |36] and the many 
references therein). 


4 Convergence 

Many randomized schemes in numerical linear algebra (see e.g. [13 m isni ED E3 E3 ESI sa SZl ES] ) 
rely at their core on approximation of a product such as AB, where, for example, A and B are 
n X n matrices, by a product of the form AQB where 0 is an n x n random matrix with E [0] = I 
and so that AQB can be assembled at much less expense than AB. For example, one might choose 
0 to be a diagonal matrix with only m <^n non-zero entries on the diagonal so that QB has only 
m non-zero rows and AQB can be assembled in 0(n‘^m) operations instead of 0(n^) operations. 
Alternatively one might choose 0 = where ^ is a nxm random matrix with independent entries, 
each having mean 0 and variance 1/m. With this choice one can again construct AQB in 0{'n?m) 
operations. Typically, this randomization is carried out once in the course of the algorithm. The 
error made in such an approximation can be expected to be of size 0(l/^/m) where the prefactor 
depends (very roughly) on the size of the matrices (and other structural properties) but does not 
depend directly on n. 
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In the schemes that we consider, we apply a similar randomization to speed matrix vector 
multiplication at each iteration of the algorithm (though our compression rules depend on the 
vector appearing in the product). As explored below, the consequence is that any stability property 
of the original, deterministic iteration responsible for its convergence, will be weakened by the 
randomization and that effect may introduce an additional n dependence in the cost of the algorithm 
to achieve a fixed accuracy. The compression rule must therefore be carefully constructed to 
minimize error. This is discussed in detail in Section In this section, we consider the error 
resulting from and for an unspecified compression rule satisfying the generic error properties 
established (with caveats) in Section]^ Both because it provides a dramatic illustration of the need 
to construct accurate compression rules and because of its importance in practical applications, we 
pay particular attention to the case in which Ad is an e-perturbation of the identity. Our results 
rely on classical techniques in the numerical analysis of deterministic and stochastic dynamical 
systems and, in particular, are typical of basic results concerning the convergence of stochastic 
approximation (see e.g. |36j for a general introduction and m for results in the context of machine 
learning) and interacting particle methods (see e.g. [17] for a general introduction and |l9| for 
results in the context of QMC). They concern the mean squared size of the difference between the 
output, V)”*, of the randomized scheme and the output, vt, of its deterministic counterpart and 
are chosen to efficiently highlight important issues such as the role of stability properties of the 
deterministic iteration Q, the dependence of the error on the size of the solution vector, n, and 
the role of dynamic self-averaging. More sophisticated results (such as Central Limit Theorems 
and asymptotic and non-asymptotic exponential bounds on deviation probabilities) are possible 
following developments in, for example, |36| and [I7lll9|. 

Our notion of error will be important. It will not be possible to prove, for example that 

remains small without a strong dependence on n. It is not even the case that 

E[||cl>r(r;)-u||i] 

is small when n is large and ||u||i = 1. Take, for example, the case that Vi = 1/n. In this case any 
scheme that sets n — m entries to zero will result in an error ||<I>J”(u) — u||i > (n — m)jn. On the 
other hand, we need to choose a measure of error sufficiently stringent so that our eventual error 
bounds imply that our methods accurately approximate observables of the form f^vt- For example, 
analogues of all of the results below using the error metric 



could be established. However, error bounds of this form are not, by themselves, enough to imply 
bounds on the error in f^vt because they ignore correlations between the components of V^. 

Remark 5. It is perhaps more typical in numerical linear algebra to state error bounds in terms 
of the quantity one ultimately hopes to approximate and not in terms of the distance to another 
approximation of that quantity. For example, one might wonder why our results aren’t stated in 
terms of the total work required to achieve (say with high probability) an error of a specified size 
in an approximation of the dominate eigenvalue of a matrix. Our choice to consider the difference 
between and vt is motivated by the fact that the essential characteristics contributing to errors 
due to randomization are most naturally described in terms of the map defining the deterministic 
iteration. More traditional characterizations of the accuracy of the schemes can be inferred from 
the bounds provided below and error bounds for the corresponding deterministic iterative schemes. 

Motivated by our stated goal, as described in Section of estimating quantities of the form 
f^vt we measure the size of the (random) errors produced by our scheme using the norm 

|||A|||= sup y/EWW] (22) 

ll/l|oc<l 
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where X is a random variable with values in C” (all random variables referenced are assumed to be 
functions on a single probability space which will be left unspecified). This norm is the (oo, 2)-norm 
of the square root of the second moment matrix of X, i.e. 


I^lll = Halloo, 2 = sup WRfh 

ll.flloc<l 


where 


R^R = E [XX“] . 

It is not difficult to see that the particular square root chosen does not effect the value of the norm. 
It will become apparent that our choice of the norm in (22) is a natural one for our convergence 
results in this and the next section. 

The following alternate characterization of IH-IH will be useful later. 


Lemma 1. The norm in (22) may also be expressed as 


sup 
l|G||cc,*<l 


E 




where 


E 'iL 

. max \Gij\ 

*=1 j=l,...,n 

is the dual norm of the oo-norm of G € 

IIGIloo = max ||G/||oo = max \Gij\. 

Proof. Indeed, 


(23) 

(24) 


E [||GX||?] = E 
where gi is the ith column of G and gi = gi/\\g: 


[(: 


\\gi\\oo \gfx 


< 1 


i=l 

nwoo- Using the condition 

E '/t 

max G 
i=l j=l,...,n 

and Jensen’s inequality we find that 

E [lio.-’fll?] < ELIL 

The other inequality follows by noting that for any / G C”" with 


(25) 


.B[\gfX\^]<\\\X\\ 


row equal to and all other rows zero satisfies the constraint (25). That (24) is the dual norm of 
the oo-norm follows from straightforward verification or see 


oo < 1, the matrix G with first 
24) is the dual norm 

□ 


Proposition 6.3]. 


Note that if the variable X is not random then one can choose fi = Xi/\Xi\ in (22) and find that 
|||X||| = ||X||i. When X is random we have the upper bound |||X|||^ < E [||X||^]. If, on the other 
hand, X is random but has mean zero and independent components then |||X|||^ = E [jlXjll] . 
Concerning the relationship between these two norms more generally, we rely on the following 
lemma. 

Lemma 2. Let A he any n x n Hermissian matrix with entries in C. Then 

sup f^Af > traced. 

ll/lloo<l 
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Proof. The result holds in n = 1 dimensions. Suppose that the result holds in n — 1 dimensions. 
We will show that it must also therefore hold in n dimensions and conclude, by induction, that the 
result holds in any dimension. 

Let A be the (n — 1) x (n — 1) principle sub-matrix of an n x n matrix A. For any vector / G C” 
we can write 

TT - •\n O fX-X-1 - 

f^Af = y^, \fi\^A,, + 2^ 


= /M/+|/n|'^nn + 23f? 


fv —^Tl —^2 — 1 — 

^j=i 
n—1 

^j=i 


[/■ 


— V- 


where / G has entries equal to the first n — 1 entries of /. 

By the induction hypothesis, we can choose the first n — 1 entries of / (i.e., /) so that the right 
hand side of the last display is not less than 

1 


e::>. 


+ |/n| Ann + 251? 


fn 




Anjf' 


If, for this choice of /, Anjfj is non-zero then choose /„ as 

fn = -^ 

A ■ f. 

^j=l ^njJj 

Otherwise set fn = I- With the resulting choice of fn, 


\fn\‘^Ann + 23 ? 


^r^—1 


0=1 


Anj /' 


We have therefore shown that 


/»E 

sup f^Afy'^, An. 

^—^ 2=1 


> Ar. 


□ 


3<1 


Lemmaj^ applied to the second moment matrix of X, implies that |||W|||^ > E [HWH^]. Summarizing 
these relationships we have 

E[||W||i] <|||W|f <E[||W||2]. (26) 

Consistent with results in the next section we will assume that the typical error from our 
compression rule is 

pr(ii)^ iii^iii (27) 


m 


for V G 


where 7 is a constant that is independent of m and n. We will also assume that 

EOll-rWII?] <C||t.||? ( 28 ) 

for some constant C independent of m and n. For all of the compression methods discussed in the 
next section, the bias ||E[$™(n)] — n||i is negligible compared to the statistical error in ( |27[ ). To 
slightly simplify expressions we will omit contributions to our bounds from compression bias (we 
do consider the bias of the overall iterative procedure). In other words we will assume that is 
satisfied exactly. 

As a result of the appearance of ||n||i in (27), in our eventual error bounds it will be impossible 
to avoid dependence on ||V)™'||i. The growth of these quantities is controllable by increasing m, 
but the value of m required will often depend on the n. The next theorem concerns the size of 
E [IIV)™Ilf] . After the statement and proof of the theorem we discuss how the various quantities 
appearing there can be be expected to depend on n. In this theorem and in the rest of our results 
it will be convenient to recognize that, in many applications, the iterates V)™ and = <I>™(V)™’) 
are confined within some subset of C"'. For example, the iterates may all have a fixed norm or may 
have all non-negative entries. We use the symbol X to identify this subset (which differs depending 
on the problem). 
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Theorem 6. Suppose that for some constants 7 and R, 

U{M{v)) < aU{v) + R 
for all V £ X, and that, for some constant j3, 

ll^^ll? < ^U{v) 

for all V £ X. Assume further that there is a constant C and some matrix G £ satisfying 

(25), so that, for all z £ C^, 

sup {D^U{v)) z < C\\Gz\\l 
where D^IA is the matrix of second derivatives oflA. Then 


E[||V',”II?1 <)9fl 




+/3a'(i + ^) mry 


Proof. Let Yf^ = and notice that 

U{vn 

<R + aUiyfff^) + a {fA{Yfff) - Uiyf^f^)) . 

Using the fact that lA is twice differentiable with bounded second derivative this last expression is 
bounded above by 

Uiyr) <R + aUiyl'l,) + aVM(y”,) (F”, - F”,) + ^||G (F”, - F”,) H?. 

An application of Lemma 0 along with the fact that E [Yff^ \ reveals the bound 

E [l|0{F”, -F”,)||?] < |||F,“, - F,“,r. 

As a consequence, noting ( [27| ), we arrive at the upper bound 

E p{vr)\ <R + ai/(F”i) + ^E [||F”, ||?] 

<-R + “(i + ^)e[m(f”,)] 

from which we can conclude that 

First, the reader should notice that setting 7 = 0 in Theorem shows that the deterministic 
iteration ([^ is stable whenever a < 1. However, even for an Jv[ corresponding to a stable iteration, 
the randomized iteration ([^ may not be stable (and will, in general, be less stable). If the goal 
is to estimate, e.g. a fixed point of Xi, the user will first have to choose m large enough that the 
randomized iteration is stable. 

Though it is not explicit in the statement of Theorem in general the requirements for stability 
will depend on n. Consider, for example, the case of a linear iteration, AA{v) = Kv. This iteration 
is stable if the largest eigenvalue (in magnitude) is less than 1. If we choose hl{v) = ||u ||2 then we 
can take a to be the largest eigenvalue of K^K and i? = 0 in the statement of Theorem The 
bound ll-lli < -v/nIMb (and the fact that it is sharp) suggests that we will have to take /3 = n in 
Theorem]^ (note that we can take C = 1 in the eventual bound). This scaling suggests that, to 
guarantee stability we need to choose m ~ n. 

Fortunately this prediction is often (but not always) pessimistic. For example, if AT is a matrix 
with non-negative real entries and Vq™ has non-negative real entries then the iterates will have 
real, non-negative entries (i.e., v £ X implies Vi > 0). We can therefore use h{{v) = (I'^u)^ = ||u|b 


E[iiurii?] <f3B[u{vn]<m 


Y jdcK' ( 1 + 


2m 


UiV, 


on 


□ 
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for V ^ X and find that we can take a = ||-Ri||i, R = 0, and /3 = 1, in the statement of Theorem]^ 
With this choice of U we can again choose C = 1 so that n does not appear directly in the stability 
bound. We anticipate that most applications will fall somewhere between these two extremes; 
maintaining stability will require increasing m as n is increased but not in proportion to the 
increase in re. 

Having characterized the stability our schemes we now move on to bounding their error. We have 
crafted the theorem below to address both situations in which one is interested in the error after 
a finite number of iterations and situations that require error bounds independent of the number 
of iterations. In general, achieving error bounds independent of the number of iterations requires 
that Ai satisfy stronger stability properties than those implied by the conditions in Theorem 
While the requirements in Theorem could be modified to imply the appropriate properties for 
most applications, we opt instead for a more direct approach and modify our stability assumptions 


on Wf to (29) and (30) below. In this theorem and below we will make use the notation Ail to 


denote Ai composed with itself t — s times. In our proof of the bound in Theorem [7] below we will 
divide the error into two terms, one of which is a sum of individual terms with vanishing conditional 
expectations. Those individual summands are Martingale differences, and the size of their sum can 
be expected to grow less than linearly with the number of iterations. This general observation is 
called dynamic self-averaging and results in an improved error bound. The improvement is essential 
in the context of perturbations of the identity and we will mention it again below when we focus 
on that case. 

Theorem 7. Suppose that the iterates of ([^ remain in X C C"". Fix a positive integer 
T. Assume that, for some a > 0 and some constants Li and L 2 and for every pair of integers 
s < r <T, for every vector f G with ||/||oo < 1 there is a matrix G G satisfying (25) and 

irMiiv) - rAii{v)\ 


sup 


< Lia'^ 


(29) 


and for every / G C” with 
there is a matrix A (a measurable function of v) so that 

If^Ailjv) - f^Ailjv) - f^Ajv - v)\ 
\\Gv-Gv\\j 

for all V G X. Then the error at step t <T satisfies the bound 

1 


\\Gv-Gv\\i 

00 < 1 and some matrix G G satisfying (25), for all v G X 


< L 2 W 


(30) 


\Vr-vt\\\<a 


'y—;={Li + L 2 ) 


m 


--^ ] Mt + 7 — Li2- - Mrp 

1 — J m 1 — a 


where = sup^-^-rE [||V(,™'||f] . 

Proof. We begin with a standard expansion of the scheme’s error. 

iiiv^r-^dii = |||^r-HM||| 

Now notice that if we define = (k^ClT") then = Air{YfA) and the last equation becomes 

\\\vr - 1.,111 = \\\Y,‘fMUY.)-M‘Avr) 

III ^r=0 

The right hand side of the last equation is bounded above by 

|||^‘;'m*(w) -E[w[*(w) IKH +^*;'iiiE[M*(w) I vn-Miivn 
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Considering the first term in the last display note that, for any fixed / G C”, 

I K™l)r] =E‘=oE[|/''(M*(K)-E[MtW I vn)\"] 


+ I ICD) X {f^{Mt{Ys)-E[Ml{Ys) I Er]))]}. 

Letting J> denote the d-algebra generated by {lL™}g^Q and {Y^Y^~q, for s < r we can write 

E[(YiMUYr) -EIMUy,) I vr])) X - E[M|(L,) | K™]))] 

= E[E[/«(Mt(Lr) - E[At* (y.) I Vr]) I J^r] X (/H(M|(y,) - E[At|(y,) | Er]))] • 

Because, conditioned on V^, Y^ is independent of J>, the expression above vanishes exactly. 
Supremizing over the choice of /, we have shown that 

iiirr - oilii < - nMMYr) I ^”1111")I vr\ - M’Avrn 

Expanding the term inside of the square root we find that 


\vr-vt\\\ < 


< 


. (Ill-M*Er) - Ml{vn\\\ + ll|E[Af*(y.) I VY] - K(e; 


1/2 


+ Y,' \\mMi{Yr)\VY]-Ml{V; 


^r =0 


Y,' \\\\Mi{Yr)-MYVr 




1/2 


+ 


Y' \\mMi{Yr)\Vr]-MYVr 


n 


1/2 


ii—/r =0 

where, in the second inequality, we have used the triangle inequality for the 1“^ norm in ML If 
Ar G is any matrix value random variable, measurable with respect to the u-algebra generated 
by VY, then 

E[M*(y,) I VY] - Mi{VY) = E[(M‘ - Ar){Yr) \ V^] - (Ml - Ar){VY). 

, we obtain the upper bound 
1/2 ^t-1 , 

+L 2 Y pr(K™)-E 


As a consequence, applying our assumptions (29) and 

, 1-1 


^r=0 


a‘^it-r)\\\^rn^yrn^_yrn\ 


^r =0 


|||Er-iii|||<(Li + L2) [Y 

Bounding the error from the random compressions we arrive at the error bound 

, 1-1 


rm I 

' r I 


\vr -I < 


tAi±M (y'"‘ a=i‘-')E [iirni?])''" + — 

Ym \Y-^r=o 1 ^ 1 / m 


Z—/j .=0 


a 


t—r 


E 


\vr\\f\ 


□ 


Condition (29) is easily verified for linear maps Ai = K where the magnitude of the dominant 
eigenvalue of K is less than 1. The condition is more difficult to verify for the power iteration map 


A condition close to (29) holds for all v,v with ||u||i = ||i;||i = 1 but the parameter a in general 
depends on the proximity of v and v to the space spanned by all of the non-dominant eigenvectors 
(see e.g. j54|l. For large enough m we can ensure that all iterates V-Y remain at least some fixed 
distance from the space spanned by the non-dominant eigenvectors but this will often require that 
m grow with re. Moreover, almost sure bounds separating the iterates from the non-dominant 
eigenspace are more than is required to prove convergence of the randomized schemes. 

Again we can identify cases in which condition (29) holds for the power iteration map without 
direct dependence on re in the bound. When the matrix K is real and non-negative we can assume 
that the iterates are also real and non-negative, i.e., the iterates remain in 

A = {u G C"" : Uj > 0, ||u||i = = 1} . 
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On the other hand, if K is also irreducible, then the non-dominant eigenspace of K consists of 
vectors with both positive and negative entries and our iterates, by virtue of being non-negative, 
are bounded away from this space. For more general problems one can expect the speedup over 
the standard deterministic power method to be roughly between a factor of n and no speedup at 
all (it is clear that the randomized scheme can be worse than its deterministic counterpart when 
that method is a reasonable alternative). A complete description of the speedup in the particular 
case that Ad is the power iteration map would be very interesting but is not pursued here. 

Bias. Even for a very general iteration the effect of randomization is evident when one considers 
the size of the expected error (rather than the expected size of the error). When Ad(u) = Kv and 
V)™' is generated by (§, one can easily check that zt = E \V^] satishes the iteration zt+i = Kzt, 
i.e., zt = vt- Even when the mapping Af is non-linear, the expected error, E [V)™'] — vt, is often much 
smaller than the error, V)™ — vt, itself. The following is just one simple result in this direction and 
demonstrates that one can often expect the bias to be (which should be contrasted to an 

expected error of 0(m“^/^)). The proof is very similar to the proof of Theorem]^ and is omitted. 


Theorem 8. 

bound 


Under the same assumptions as in Theorem\^ the bias at step t < T satisfies the 


\E[vn-vt\\i< 


7 ^L2 q:(1 — a^) 


m 


1 — a 


sup E [UK 

r<T 


mn 2 i 


Perturbations of identity. When the goal is to solve ordinary or partial differential equations, 
stronger assumptions on the structure of Ai are appropriate. We now consider the case in which 
Ad is a perturbation of the identity. More precisely we will assume that 

M{v) = V + eb{v). (31) 

Though we will not write it explicitly, further dependence of 6 on e is allowed as long as the 
assumptions on b below hold uniformly in e. In the differential equations setting e can be thought 
of as a time discretization parameter as in Section 

When Ad is a perturbation of identity, it is reasonable to strengthen our assumptions on the 
error made at each compression step. The improvement stems from the fact that the mapping Ad 
nearly preserves the sparsity of its argument. As we will explain in detail in the next section, if 
V G Sm where 

5m = {z G C" : #{j : Zj 0} < m} 

and w G C^, then it is reasonable to assume that, for example, 

C - - 

|||<l>)"(u + w) — V — rclll < llu -|- rcllf (32) 

\/m 

for some constant C independent of m and n. The following Lemma illustrates how such a bound 
on the compression rule can translate into small compression errors when Ad is a perturbation of 
the identity. 


Lemma 3. Suppose that (32) and (28) hold. If M.{v) = v + eb{v) and, for some constant L, 
||K'^)lli + ||u||i) for all V G X, then, for V)™' generated by (§, and for some constant 7 , 

mrivn - vnf < f^v'E[iK-i| 2 ]y/i+E[iiK™iii?] (33) 


Proof. If TK = ^r(^r) then 

E [\r<i>T{vn - rvr\^ \ = e {Yt-i+ebir,^,)) - r (k™ + P1 

<c^biYffi,)Mvr\\i 

m 
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for some constant C. Our assumed bound on the growth of b along with (28) implies that 

E[||6(yA)||?] <c-' (l + E[||yAII?]) 

for some constant C. From these bounds it follows that for some constant 7 , 

mTivn - vnf < [||y-||f]y/i + E[||y,™ Ilf], 

We now provide versions of Theorems and [^appropriate when W1 is a perturbation of identity. 
The proofs of both of these theorems are very similar to the proofs of Theorems and and are, 
at least in part, omitted. First we address stability in the perturbation of identity case. 


□ 


Theorem 9. Suppose that the iterates 


of @ remain in X d C” and that ( |33| ) holds. Suppose 
further that U satisfies the conditions in the statement of Theorem with the exception of the 
following: Now 

U{M{v)) < e^^U{v) + eR. 


Then 


sup B[\\vr\\i] <m 

t<T je 


exp 


£ + 




- 1 


“ ^ 2m 


+ /3exp 


T ( a + 


2m 


U{V^). 


where the various constants appearing in this expression were defined in and before the statement 
of Theorem\^ 

What is important about the statement of Theorem [^ is that the bound remains stable as e 
decreases despite the fact that set being supremized over is increasing. Under the assumptions in 
the theorem (which are only reasonable when W 1 is a perturbation of the identity) one can expect 
that the iterates can be bounded over O (£~^) iterations uniformly in e. 

The following theorem interprets the result of Theorem when W1 is a perturbation of identity. 
One might expect that, over iterations, O {\/£) errors made during the compression step 

would accumulate and lead to an error of 0(£“^/^). Indeed, this is exactly what would happen if 
the errors made in the compression step were systematic (i.e., if the compression bias was O (\/£)). 
Fortunately, when the compression rule satisfies the consistency criterion 0 the errors self average 
and their effect on the overall error of the scheme is reduced. As mentioned above, this phenomenon 
played a role in the structure of the result in Theorem and its proof, but its role is more crucial 
in Theorem 10 which provides uniform in e bounds on the error of over 0{£ iterations. 
Without the reduction in the growth of the error with t provided by self-averaging it would not be 
possible to achieve an error bound over O (£~^) iterations that is stable as e decreases. 

Theorem 10. Suppose that the iterates of remain in X C. C"" and that ( |33| ) holds. Fix a 
real number T > 0. Assume that 

M{v) = V + £b{v). 

Assume further that, for some real number fi and some constants Li and L 2 and for every pair of 
integers s < r < T/e, for every vector f £ with ||/||oo < 1 there is a matrix G £ satisfying 


(25) and 


sup 


irMijv) - rM:{v)\ 
\\Gv-Gv\\i 


< Lie 


—£/3(r—s) 


(34) 


and for every / G C” with ||/||oo < 1 and some matrix G £ satisfying (25), for all v £ X 

there is a matrix A (a measurable function of v) so that 

\rM:{v)-rMl{v)-rA{v-v)\ 


WGv-Gvf, 


< L2e 


—elS(r—s) 


(35) 
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for all V £ X. Then the error at step t <Tfs satisfies the bound 

lllvr - filll < [llv?”!!?] + + ' 

+ ^ [IIV?”!!!] + + . 

where M|. = [||E/”||^] . 

Proof. By exactly the same arguments used in the proof of Theorem we arrive at the bound 

1 

- vt\\\ < {Li + La) ) 

+ La 

ii—^r =0 

Bounding the error from the random compressions we arrive at the error bound 


' “ "*^*111 < 


7(Li + L 2 ) f -2Sts 


^EllbCII?] + 




fL2 


E t — 1 

r =0 


,-S(t-r)£ 


y^E[||E-||?]-\/l + EOl^t-ill?]- ° 


Though the error established in the last claim is stable as £ decreases, we have mentioned in 
Section 3.2 that when is a perturbation of identity, by using iteration instead of (^, one 
might be able to obtain errors that vanish as e decreases (keeping m fixed). This is the subject 
of Theorem 11 below which, like Theorem 10 relies crucially on self-averaging of the compression 
errors. Note that iteration (10) typically requires O (n) operations per iteration and storage of 
length n vectors. We have the following theorem demonstrating the decrease in error with e in this 
setting. 


Theorem 11. Suppose that the iterates of (10) remain in X C and that (27) holds. Under 

or 

r(Ll + L2 )Li 


the same assumptions on A4 as in Theorem 10. the error at step t <T/£ satisfies the bound 


sup \\\Vr-vt\\\ < 

t<T/E 


m 


1 — e 2 


2/3 


sup 

r<Tl£ 




+ 


£'y^L 2 Li /1 — e 


m 




sup E\\\V, 

r<T je 




Proof. By a very similar argument as in the proof of Theorem arrive at the bound 

iiivT - oilii < +'='>(7”)) - Mu,{vr+ 5Kv;”))r) 


+ 


(EXqIii^ i^r+iivr + ebivn) i k™] - muv. 




1/2 


+ E'lllllE [MUi(vr + eb(Yr)) I vr] - MU,(Vr + eb(V, 


^r=0 


n 


which, also as in that proof, is bounded above by 


|Er-^;t||| < (L1 + L2) 




^r=0 


a 


Xt-r-l) 


MYn-biK 


Dif 


1/2 


From (34) and Lemma we find that 

\\\b{Yn-b{vn 

The rest of the argument proceeds exactly as proof of Theorem 


^—^r=0 


n 


<Li|||y™-E; 


□ 
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5 Compression rules 

In this section we give a detailed description of the compression rule used in our numerical simu¬ 
lations as well as several others and analysis of the accuracy of those schemes. Our compression 
schemes have the general form in Algorithm Programmed efficiently, and assuming that v has 


Data: V G C"" with all nonzero entries, m G N 
Result: V = G C”" with at most m nonzero entries 

£ = 0 ; 

V = 0; 


r = ||u||i/m; 

I = argmaxi{|ui|}; 
while \vi\ > r do 

i = e + i-, 

Vi = VI] 


VI = 0 ; 

r = ||u||i/(m — i)] 

I = argmaxi{|uj|}; 

end 

For each j let Nj be a non-negative random integer with E [A^] 


finally, for i = 1,2,... ,n set 


Vi = Ni 


Vi\{m - £) ’ 


{m - £)\vj\/\\v\\i] 


Algorithm 1: A simple compression rule. 


exactly n nonzero entries, all of the schemes we discuss in this section will require at most 0{n) 
floating point operations including the generation of as few as one uniform random variate and 
0{n + mlogn) floating point comparisons. It is likely that better compression schemes are possi¬ 
ble, for example by incorporation of ideas from m- The reader should note that in this section 
n represents the number of non-zero entries in the input vector v of the compression rule and not 
the dimension associated with a particular problem (which may be much larger). 

What is left unspecified by the steps in Algorithm is the choice of the joint distribution of the 
Nj. For many applications, the details of this choice are crucial. The simplest possibilities, such 
as choosing the Nj to be distributed according to a multinomial distribution with probabilities 
Pj = |r’j|/||'c||i, can easily result in unstable schemes. Before detailing better choices, we remark 
on the deterministic procedure preceding the generation of the Nj. Let cr be a perturbation of 
{1, 2 ,..., n} so that the elements of v^ have decreasing magnitude and let 

= minjo < £ < m : I I} • (^6) 

r™ is the number of entries of v that are exactly preserved (in V) during the while loop in 
Algorithmic Note that if the number of nonzero entries in v exceeds m then r™ < m. 


Lemma 4. r™ satisfies 


E 


I I ^ “ TV 
To-- < - 

j=T^ + l 


m 


T 1 - 


Proof. Observe that if r™ > 0, then condition 

E^ 


I I ^ II II 

\Vti \ S - T 1 

j=t+i' ^ m 
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holds for £ = 0. Assume that 


E 


j=e 


m — i + 1 

VctA < - b 1- 


m 


for some i < r™. From the definition of r™ and the fact that i < r™, we must also have that 


— V 

Combining the last two inequalities yields 

E" 




'j=e+i 


m — i 

VaA < - \\v\\i. 

■' m 


□ 


As we will find below, this Lemma implies that our compression rules will have errors that decrease 
as r™ increases, justifying the presence of this additional deterministic step. In cases where the 
\vaj \ decrease rapidly with j, the reduction in error can be dramatic. 

Returning now to a more detailed description of the joint distribution of the Nj we begin with 
a particularly simple choice (and one that will have to be modified later) suggested by common 
practice in diffusion Monte Carlo applications. For each j generate an independent uniform random 
variable in (0,1), and set 


Nj = 


[m — i)\vj 


+ 


(37) 


One can easily check that E [Aj] = (m —7)|ujj/||'(;||i. The next lemma characterizes the error result¬ 
ing from this choice. Because we have a particular interest in cases in which At is a perturbation 
of identity and because, in those cases 

\mvn - ^riii = wiMYt-i+^ kyaai)) - - eb{YAA,)\\\ 

where = ‘h™ ;^(V)™') has at most m non-zero entries, we will pay close attention to the effects on 
our error bounds of a perturbation w G C” of a vector v with only m nonzero entries, i.e., v G Sm- 
The lemma includes an estimate of E [||‘I’J"(u)||^] and of the probability of the event {<I>i(u) = 0} 
which will both play a role in the discussion immediately following its proof. 


Lemma 5. For <!>* defined by Algorithm^ with (37) and for any vectors v G Sm and w ^ C^' 




|||<hi(u + w) — V — U)||| < V 2 








v-\-w 


+iTi 


w. 




T, 


The righthand side of the (38) is itself bounded above by 

^/2 


1 1 
||u -|- 


v-\-w 


m 

Concerning the size of the resampled vector we have the bound 

E [||#r(!> + to)llf] = IlK + <"11? + 2- + 

Finally, if > 0 then P [<ht(u -|- re) = 0] = 0. If = 0 then 

llrclli 1 


(38) 


(39) 


m 


P [$t(u) = 0] < I m in 


u-|-tc 1 e 


Proof. First we assume that, for all j, \vj + Wj\ < ||u-|-t(;||/m. We will remove this assumption later. 
With this assumption in place, Nj G {0,1} and the while loop in Algorithm is inactive so that 


f''Mv+w)^Y.AN^^N. 


m 


'E,[\f^<^tiv + w)-v-w\‘^]='E 




\v -|- tc 


m 


Nj — V — w 
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The random variables in the snm are independent so the last expression becomes 


E 


[\f^^t{v + w)- f^{v + n;)p] = 


E 


\v + tc 


m 


Nj — V — w 


P + w\\i 


E n 

var [iV^-] . 


Since Nj G {0,1}, the expression for the variance of Nj becomes 


var[iV,-] = E[iV,](l-E[iV,-]) = 


m{vj + Wj) f ^ m{vj + Wj) 


so that 


E [\f^^t{v + w) -V - ^ ^ 


n + will 

2 


m — 


m 


\v + w||l 
2 


n + w 1 


\v + w ||2 


Becanse this scheme does not depend on the ordering of the entries of + w we can assume 
that the entries have been ordered so that Vj = 0 for j > m. In this case we can write 

1 /v—\ 2 1 


II ii2 I 

iT + w'lb = + 

which then implies that 


+ E 


j=m+l 


W 


•3\ ^ 


> — 


m 




Vj + Wj 


II 112 / 

E [\rMv + w)- ( 1 


1 

\v + w||f 


> — ( n + W 1 - w i) 
m 


|f + w||i - ||w||i) 


||w||i , ,, .. 2||w||i||n + will 

< ^^( 2 ||i; + w||i - ||w||i) < " 


ra 


m 


Expression (38) follows from the definition of and the fact that for j < are preserved 
exactly by <ht. The upper bound in ( |39[ ) follows after an application of Lemma 

In bounding the size of <l>™(i; + w) we will again assume that = 0 and that the entries 
have been ordered so that Vj = 0 for j > m. The size of the resampled vector can be bounded by 
first noting that, since the Nj are independent and are in {0,1}, 


E 






m\Vj + Wj\ ^ 

^3=^ ll^' + 'U'lli 2^i=\2^ 


E n 

.7 = 1 


m\vi + Wj| m\vj + Wj\ 


m\vj + Wjl 


n + w||i 

n m\Vj + Wj 




j=i+i ||i; + w||i lln + wlli 
n m\vi + Wi\ m\vj + Wj\ 


+E 


j=l kj + will 


= m 


+E 


n m\vj + Wj\ (m\vj + Wj 


i=*+i ||i; + w||i 

I I 1x2 

m\Vj + Wjl 

||n + w||i 
2 


T + ii' 1 


^i=i llf + wlli 
Breaking up the last sum in this expression we find that 

I , I / I , I \ 2 

\m m\Vj + Wj\ '' 


\v + W 1 


E lfL 

.7 = 1 


i=i n + w 1 


77l|nj + Wj\ 

||n + w||i 


77 1 T W 7 

n -n-^ 

i=l ||77 + W||l 

Vj+Wj\ 


E m 

.7=1 


< m 1 — 


1-V 

Z^_7 = l 


and that 


so that 


E 


m\Wn 


j=m+l ki + ID b 


771 Wl 


i=i Ik + w||i 
2 


< 


m \vj + Wjl 
i=l ||77 + w||l 
7TT,||w||l 
\v + w||i 


77 + W 1 


< 


777||W||l 
||77 + w||l 


E 


[(e;. 


N, 


2 i 


9 771 W 1 

<777^ + 2-: " " 


It follows then that (at least when = 0), 

E [||<I>™(77 + W)||f] < ||77 + w||l + 2 


77 + W||l 

||77 + w||l||w||l 


777 
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Writing the corresponding formula for > 0 and applying Lemma gives the bound in the 

statement of the lemma. 

Finally we consider the probability of the event + re) = 0} . If = 0 then Nj G {0,1} 

so that P [Nj = 0] = 1 — m\vj + tCj|/||u + rc||i, and, since the Nj are independent, 


P[^. 



m\vj + Wj 

||u + will 



m\vj + Wj\ 
||u + w||l 


The first product in the last display is easily seen to be bounded above by e . The second product 
is maximized subject to the constraint 




1 - 


m\Vn 


Wi 


< 


m||w||i 


^j<n,vj^oy ||i; + w||i / “ ||u + w||i 
when the terms in the product are all equal, in which case we get 


P [Nj = 0 for all j] < 


\v + W 1 


□ 


In practice one would need to modify the choice in (37) to avoid several potential drawbacks. 


First, With any application of the resulting compression rule (assuming that r™ = 0) there is a 
non-zero probability that = 0 (i.e., that all the Nj = 0). As Lemma demonstrates, the 

probability of this event is extremely small. The issue can be avoided by simply sampling 
until 7 ^ 0, i.e., sample conditioned on the event ^ 0} . The resulting bias 

and additional cost are negligible (as can be seen using the estimate of the probability of the event 
{<I>5|”(u) = 0} given in Lemmaj^. Another potential drawback is the possibility that $t”(u) has more 
than m non-zero entries. This can be easily avoided at insignificant additional cost by randomly 
setting entries of <h™(u) to zero (and renormalizing the result) using the systematic compression 
procedure described below. The resulting compression of v has the same norm as <h)”(u) and 
exactly the same expectation. The resulting statistical error is also insignificant (a fact that can 
be established using the estimate on the size of ^™(u) in Lemma [^. 

A much more significant drawback associated with the choice ( |37[ ) is the requirement that 
we generate n (assuming v has n non-zero components) independent uniform random variates. 
Depending on the cost of evaluating Ai(yi^), the generation of so many random variables could 
constitute a significant portion of the cost of the overall scheme. An unbiased method that is 


observed in practice to behave similarly to the scheme in (37) but which requires only a single 
random variate, exactly preserves the norm of the input vector, and produces an output vector 
with exactly m non-zero entries, can be defined by, for k = 1,2,... ,m — setting 

1 


Ik = max 


{> ^ 


and 


where 






1 


m — £ 


lA 

W 

A 

M 

II 

(40) 

if 3 G {4}, 
if 3 i {4}, 

(41) 

-ik-l + U). 

(42) 


and U G (0,1) is a single uniformly chosen random variable. The random sampling step in the 
resulting compression scheme corresponds to the so called systematic resampling scheme used fre¬ 
quently in the context of sequential Monte Carlo (see e.g. [Ej). It is well known in that context, 
that systematic sampling can fail to converge for certain sequences of input vectors (they need to 
change as one increases m or the method will converge). Nonetheless, systematic resampling is 
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an extremely popular technique and in many applications results in lower error than competing 


schemes. Similarly, for certain choices of n, the compression scheme specihed by (41) and (42) is 


guaranteed to converge only as mjn approaches 1. Supporting the use of the scheme for problems 
in which Af is a perturbation of the identity, it is possible to prove that if, after exact preservation 
of all entries . + Wa . with j < , the remaining entries are re-ordered so that entries with 

Vj = 0 appear at the end of the list of remaining entries in n + tc, then (assuming = 0 for 

simplicity) 


|<l>”^(u + w) —n — rclli < 2||u;||i 


when ||tc||i < ||u + w\\i/m. Though we have not used it in our numerical tests, the additional 
rearrangement step imposes no substantial additional cost and is essential to this bound. 


Slight modifications of (41) and (42) do result in provably convergent schemes. For example, an¬ 


other commonly used resampling scheme in sequential Monte Carlo called stratified resampling (see 
e.g. [16] ) corresponds to choosing each uniform random variable G [{k — 1)/(m — £), k/{m — i)) 


in (40) independently. Again the resulting scheme is unbiased and preserves the norm of the 


input vector. The following lemma provides a bound on the scheme’s error. 

Lemma 6 . Let a be a perturbation of {1,2,... ,n} so that the elements of v„ -|- have decreasing 


magnitude and let be defined as in (36) with v replaced by v + w. For the compression rule 
defined in Algorithm\l with chosen independently in (40), 


|‘h™(n + w) — V — rc||| < 


1 


E” 


frn F™ 


X min< 1, 


(m - TfXJ (Em,,, k-i I) 1 


E n 

i— 


■7=r" 
J y 


\Va. + WaA 


for any pair v,w £ with v G 5™. The right hand side in 


\v -|- tc 1 


min< 1 , 


m 


{m — r, 


v-\-w) 


F' 1 


is itself bounded above by 

11 / 2 ', 


n -|- re 1 


(43) 


(44) 


m 


Proof. Because the scheme exactly preserves entries for £ < and, for all £ > 

1 + Wa A < 2^ I Va, + Wa, \, 

Ilf' I v-\-W J~‘v + W^^ 

it suffices to consider the case in which, for all j, \vj + Wj\ < ||n -|- rc||i/m. Expression (43) will 


follow by setting to 0 entries with index aj for j < T{m) in both v and w and replacing m in the 
eventual bound by m — . 

Let V = <h"*(n -|- w) and suppose / G C"' with ||/||oo < 1- Our goal is to establish an upper 
bound for 

E[\rV-riv + w)\^] =^rar[rV] 

where var on the right hand side of the last display represents the variance of the random variable 
f^V and we have used the easily established fact that E [F] = v + w. Let sq = 0 and 

1 I I 

A'= Ti —^—ir / .■ A'>^i + wi\ 

||n -|- tclli 

for j = 1, 2,..., n, and define the function I{u) for u G [0,1] by 

I{u) = max{j : Sj_i < u < Sj} . 

Then we can write 

chta _ l|n + rclli J 


rv = 


m 




2 /(f/(fe))| 
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so, since the are independent, 


var [rV] = 


P + wWi 


E fll 

k=l 


Vi(uW) 


Now let 


Pk,j = m\[sj-i,Sj) n [{k - l)/m,k/m) \ . 
In terms of the pkj we can write 


l^7(f/(*))l 




Vi{uW) 


E TT. — r> » — 

l/jl Pk,j - X]i=i ■^*1 


For /c = 1, 2,..., m, let 


We hnd that 




Vi{u(.k)) 

b/(;7(fe))l 


jk = argmaxpfcj and ru = Pk,jk- 
]<n 


+ E”,J. 


+ l/iiVi - l/iJ^’'fc + 2|/iJrfc 


i¥=jk \Vi 


< rfc(l - rk) + (1 - Tk) + 2rfc(l - rk) < 4(1 - rk). 

It remains to bound the sum “ ^k) ia terms of ||r(;||i/||u + rc||i. Define ckq = 0 and for 

k = 1,2,... ,m. Let ak be the index of the kth. entry of u + tc with Vk / 0. Notice that, since 
\vj + Wj\ < ||u + rc||i/m, 

^ k . ||^L’||l k + l 


Sai. <-1" 


m u + rc 1 m 


Moreover, notice that 


and that if 


E tn 

7 = 1 


™ \Vai+WaA.^, 

-rr^ > 1 — 


.7=1 ||u + rt;||i 


|u + tclll 


^-^fc+1 \vctj I ^ A; -|- 1 ||ii^||i 

"^7=1 ||u + r(;||i “ m ||u + tc||i' 


"^7=1 ||u + r(;||i 


\Vak+i + 'l’afc+1 


\v + w\ 


-fc+i I 

1 ^7 = 1 


fc+1 + WaA 


||u + will 

k ||ii^||i 


^ 1 _^k + 1 llifilli k ||ii^||i 

~ m m ||u + w||i m ||i; + w||i' 


allowing us to conclude by induction that 

E k I Ha,* 4“ ■ I k II w 111 k — 1 

~ir^ - u ^-71-n“ >-• 

7=1 ||u + w||i m ||u + w||i m 

On the other hand, these bounds, and the fact that \vj + Wj\ < ||u + w||i/m, also imply that 


■Sot-i —-4 


A: — 1 ||iii||i k 


m ||i; + w||i m 


< — and s^fc-i > 


A — 1 ||ia||i k — 2 

m ||u + w||i m 
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From the bounds (45) and (46) we conclude that within any interval [{k — l)jm,kjm) we can 
find at most two points in the set {sajj and those two points are Sa^_-^ and Likewise, we can 
find at most the two points Sa,.-i and from the set in [{k — l)/m, k/m). 

If Sak is contained in the interval [{k — l)/m, k/m) then so is the complete interval (max{(fc — 
l)/m, Sqj,). If the entire interval is contained in [{k — l)/m,k/m) then > 

m\va^. + tCoj,|/||u + tclli- Since, for any i, 

\Vaj +Wa^\ + \Vaj + Waj\ > ||u + rc||l - HtcHi, 

and \vj + Wj\ < ||u + w\\i/m, we hnd that, for any j, 

\vaj + Wqj \ > ||u + w\\i/m - lirclli 

and therefore that 


'i--rk< 


u + w\\i 


If SQfc-l < {k — l)/m then > rnsa^, — (A: — 1) and, from (|46[), 


msa^ - {k - 1) = m [ Sa^ - — ] + I, 


m 


so that 


l-rk< 


m||rc||i 


\v + tcjli 

On the other hand, if > k/m then is in the interval [{k — l)/m, k/m) along with the 

complete interval k/m). Therefore, > k — msai^-i and, from 


k — msau-i > 1 + m 


k-l 


m 


- Sak-l > 1 + 


m\\w\\i 
||u + tclli 


so that again. 


1 - ’’fc < 


m\\w\\i 


\v + tcjli 

Since > 0 the above bounds on 1 — allow us to conclude that 


so that 


~^k) <m min 1, 


E[\rv-riv+w)\‘^] <4 


\v + tc 1 


llu + tclh 


min < 1 


m 


m\\w\\i 


’ ||u + will 

concluding the proof of (43). The upper bound (44) follows by an application of Lemma □ 


Though the bounds in Lemma are somewhat crude, they are sharp in one respect. One can 
find sequences of vectors achieving an error scaling as ||u + w||J'^^||w|| as m is increased. As a 
consequence, when A4 is an e perturbation of the identity, we can expect |||<hi(V^”^) — V^"*||| to be of 
order ^/e for this scheme, but we cannot guarantee that the error will remain controllable with m 
(e.g. as e decreases. This point demonstrates that careful attention to the compression 

scheme is required in cases in which A4 is a perturbation of the identity. In that context, even 
schemes inspired by well regarded resampling strategies may not be effective. 
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6 Numerical tests 


In this section we describe the application of the framework above to particular matrices arising 
in (i) the computation of the per-spin partition function of the 2D Ising model, (ii) the spec¬ 
tral gap of a diffusion process governing the evolution of a system of up to five, 2-dimensional 
particles (i.e., up to ten spatial dimensions), and (hi) a free energy landscape for that process. 
The corresponding numerical linear algebra problems are, respectively, (i) computing the domi¬ 
nant eigenvalue/eigenvector of matrices up to size 10^® x 10^®, (ii) computing the second largest 
eigenvalue/eigenvector of matrices up to size 10^® x 10^^, and (hi) solving a linear system involving 
exponentiation of matrices up to size 10^*^ x 10^°. Aside from sparsity, these matrices have no known 
readily exploitable structure for computations. 

The reader may wonder why we consider random compressions instead of simple thresholding, 
i.e., a compression rule in which, if aj is the index of the jth largest (in magnitude) of v, v^. is 
simply set to zero for all j > m (the resulting vector can be normalized to preserve £^-norm or not). 
In the rest of this paper we will refer to methods using such a compression rule as truncation-by-size 
(TbS) schemes. TbS schemes have been considered by many authors (see e.g. |26l EOj HSJ HS] ) and 
are a natural approach. Note however that the error (if the compression is not normalized). 



for the thresholding compression can be as large as 11 / 110011 ^^ 111(1 — ^-/?^) which only vanishes if m 
is increased faster than n. In contrast, the random compressions above can have vanishing error 
even when n is infinite. This observation is key to understanding the substantial reduction in error 
we find for our fast randomized scheme over TbS in numerical results presented in this section. 

In order for the FRI approach to yield significant performance improvements, one must use 
an efficient implementation of matrix by sparse vector multiplication. In the examples below we 
list the / j pairs for which the product KijVj is nonzero, then sort the products according to the 
i-index and finally, add the products with common i. This is a simple and sub-optimal solution. 
In particular, if the entries of the matrix K can be efficiently accessed column by column then the 
sorting step can be avoided. More details can be found in the example code available here [55]. 


6.1. A transfer matrix eigenproblem In this example we find the dominant eigenvalue A* 
of the transfer matrix K of the 2-dimensional £-spin Ising model. This eigenvalue is the per-spin 
partition function of the infinite spin Ising model, i.e., 

A^(T,R)= hm 


where aij G {—1,1} and the sum in the exponent is over pairs of indices on a square 2-dimensional, 


periodic lattice with i sites. The outer sum is over all 2 
cannot possibly compute it directly. The matrix K is 2 


possible values of a, and for larger £, one 
X 2^. For example, in the case £ = 3, 


K = 


a 
b-^ 


where 


2 ^ 


b-^ 

^-1 


6-1 

^-1 


(48) 
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We therefore also cannot hope to apply the power method (or its relatives) directly to K when (. 
is large. In our experiments we set T = 2.2, B = 0.01, and £ = 50 so that n = 2^ > 10^^, We 
apply both the FRI and TbS, 0(1) schemes to computing A* as well as to computing the sum of all 
components of the corresponding eigenvector, u* (normalized to have sum equal to 1), with index 
greater than or equal to 2^®, i.e., 

Knowledge of the partition function A* as a function of temperature T and field strength B 
allows one to determine useful quantities such as the average magnetization (sum of spin values) 
and to diagnose phase transitions [26] . Our choice to estimate A* and /* is motivated in part by the 
fact that these quantities can be approximated accurately by the corresponding values for smaller 
Ising systems. We will compare our results to those for the 24-spin Ising model which we can 
solve by standard power iteration. For an effective specialized method for this problem see [Hj. A 
simple, educational implementation of Fast Randomized Iteration applied to this problem can be 
found here [55] . 



Figure 1: Top. Trajectory averages of the approximation, A™, of the partition function for the 
50-spin Ising model with B = 0.01 and T = 2.2, with 95% confidence intervals® as computed 
by the FRI method with m = 2^ for A: = 20, 21, 22, 23, and 24. The best (highest m) estimate 
for A* is 2.584 a difference of roughly IB from the value for the 24-spin Ising model. Bottom. 
Corresponding trajectory averages of the approximation, T™, of the total weight of all components 
of u* with index greater than or equal to 2^® with 95% confidence intervals for the FRI method. 
The best (highest m) estimate for /* is 0.606, a difference of roughly 5i? from the value for the 
24-spin model. 


In Figure]^ we report the trajectory averages of the approximations A)” and generated by 
the FRI scheme (iteration (16) using Algorithm with m = 2^®, 2^^, 2^^, 2^^, and 2^^ and 10® 
total iterations. The best (highest m) approximation of A* is 2.584 and the best approximation of 
/* is 0.606. The results for the 24-spin Ising problem are A* ~ 2.596 and /* ~ 0.658 a difference 
of roughly IB and 5B from the respective approximations generated by the FRI method. In 
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Figure 2: Top. Trajectory of the approximation, AJ", of the partition function for the 50-spin Ising 
model, for the FRI method with m = 2^^. The approximate integrated autocorrelation time for 
A™ is 20.5 iterations. Bottom. Corresponding trajectory of as computed by the FRI method. 
The approximate integrated autocorrelation time for is 274 iterations. 


Figure we plot the corresponding trajectories of A™ and F™. These plots strongly suggest that 
the iteration equilibrates rapidly (relative to the total number of iterations). Indeed, we estimate 
the integrated autocorrelation time^of A™ and F™ to be 20.5 and 274 respectively. This in turn 
suggests that one could achieve a dramatic speedup by running many parallel and independent 
copies (replicas) of the simulation and averaging the resulting estimates of A* and /* though we 
have not taken advantage of this here. 

Figurej^reports the analogous trajectories (see Equation (49) below) of A™ and F/" as generated 
by iteration ( |16[ ) with the TbS scheme (iteration (16) using truncation-by-size) and the same values 
of m. The best (highest m) TbS approximation of A* is 2.545 and the best TbS approximation 
of /* is 0.014, a difference of more than 5F and 64F respectively. In Figure we plot the sums 
of the values of the approximation, V)™", of the dominant eigenvector of the Ising transfer matrix 
at t = 10^ over 256 intervals of equal size out of the 2^^ total indices. The top plot represents 
as generated by the FRI method and the middle plot represents as generated by the TbS 
approach. The TbS iteration has converged to a vector with nearly all of weight concentrated on 
very low indices. The bottom plot in Figure [^represents the dominant eigenvector for the 24-spin 
Ising transfer matrix. The qualitative agreement with the realization of V)™ as generated by the 
FRI method is much stronger than agreement with the result of the TbS method. 


Remark 12. In this problem we compute the dominant eigenvalue A* and a projection /* of the 
dominant eigenvector u* of the matrix K defined in equation (48) using the FRI in conjunction 


®According to the central limit theorem for Markov processes (assuming it holds), for large t the variance of the 
trajectory average of A)" should be jt where is the infinite t limit of the variance of At and r is the integrated 
autocorrelation time of Af". Roughly, it measures the number of iterations between independent Af*. 
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Figure 3: Top. Trajectory of the approximation, AJ", of the partition function for the 50-spin Ising 
model, for the TbS method with m = 2^^. The best (highest m) approximation is A* ~ 2.545, 
a difference of about 5B from the value for the 24-spin Ising model. Bottom. Corresponding 
trajectory of F/” as computed by the TbS method. The best (highest m) approximation is /* ~ 
0.014, a difference of almost 645 from the value for the 24-spin model. 


with the power method. Using the trajectory averages 

AT = \jZ^T and KT = \jZPr (49) 

S = 1 S=1 

to estimate A* and /* would seem strange had the iterates A™ and F™ been generated by the 
deterministic power method (we have not reported trajectory averages for the deterministic TbS 
approach). However, for finite m we do not expect A™ or F/" to converge to A* and /* as t increases. 
Rather we expect that the distribution of A™ and F™ will converge to some distribution roughly 
centered around A* and /* respectively. Though in our convergence results we have not addressed 
the ergodicity of the Markov process V)™, one would expect that reasonable functions of such 
as A™ and F/" satisfy a law of large numbers so that, for very large t, the trajectory averages A™ 
and F[^ differ from A* and /* only by a systematic error (i.e., they converge to the limit of the 
expectations of A* and F/” respectively). 


6.2. A PDE eigenproblem For given functions b{x) and cr{x) with values M” and M"' x W, the 
backwards Kolmogorov operator 

Lf = b^Df + ^ trace (cJcr^F^/) (50) 

is the generator of the diffusion process 

dX{t) = b{X{t)) dt + a{X{t)) dW{t) (51) 

where W(t) is an r-dimensional Brownian motion, Df is the vector of first order derivatives of /, 
and D^f is the matrix of its second order derivatives. The operator L governs the evolution of 
moments of X (t) in the sense that 


jB4f{X{t))] 


Lf{x) 
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Figure 4: Top. Sums of the values of the approximation, V^, of the dominant eigenvector of the 
50-spin Ising transfer matrix with B = 0.01 and T = 2.2, at t = 10^ over 256 intervals of equal size 
out of the 2^*^ total indices for the FRI method with m = 2^^. Middle. Corresponding sums for the 
TbS method. Bottom. Exact eigenvector for the 20-spin Ising model for qualitative comparison. 


(the subscript on the expectation indicates that Xq = x). Note that constant functions are in 
the kernel of L. The non-trivial eigenfunctions of L all correspond to negative eigenvalues. The 
magnitude of the greatest of these negative eigenvalues is the spectral gap and characterizes the 
rate of convergence of expectations such as ¥!ix[f {X to their equilibrium (large t) values. 

In this subsection we consider estimation of the largest negative eigenvalue of L for b = —DV 
with 


V{x 


( 1 ) 


i 


= - y ^ cos(27rx^‘ 
i=i 




cos(27rx 


OR 

2 ) 


+ 2 ^ ^ cos(7r(a:^'^ — )) cos(7r(x2 — ^ 2 )) 

j=i k=j+i 

for x^^^ = {x^i\x 2 '^) G [—1,1) X [—1,1). The diffusion coefficient, cr{x), is hxed as \/2. The function 
V is the potential energy for a periodic system of £, 2D-particles, each subject to both an external 
force as well as a nonlinear spring coupling the particles together. Equation (51) is a model of the 
dynamics of that system of particles (in a high friction limit). 

The equation Lg^ = is first projected onto a Fourier spectral basis, i.e., we assume that 


g*{x) = 


X 




v{a) e®' 


where a = ..., with arid the symbol is used to indicate that both 

and aj ® are integers with magnitude less than N. 

Suppose that L is the corresponding spectral projection of L (which, in this case, is real). The 
matrix L can be decomposed into a sum of diagonal (corresponding to the second order term in L) 
and non-diagonal term (corresponding to first order term in L), i.e., 

L = A + D. 


®We use the term “confidence intervals” loosely. Our results clearly indicate that m is not so large that systematic 
errors and deviations from asymptotic normality can be safely ignored. 


32 










In this problem we are trying to find not the eigenvalue of L of largest magnitude but the largest 
(they are real) eigenvalue and a transformation is required. As we mentioned in Section 3.2 this 
can be accomplished by applying power iteration to a matrix obtained from a discrete-in-time 
approximation of the ODE 

‘‘ i 


i.e., by exponentiating the matrix tL for very large t. For example, for sufficiently small e > 0, the 
eigenvalue, /r, of largest magnitude, of the matrix 

K = {I + eL)ey^ (52) 

is, to within an error of oder e^, 1 -|- eA* where A* is the eigenvalue of L of largest real part (in 
our case the eigenvalues are real and non-positive). We will apply our iteration schemes to K. By 
fixing ut(0) = 0 we can guarantee that the approximate solutions all have vanishing integral over 
[—1,1)^^, ensuring that the iteration converges to an approximation of the desired eigenvector/value 
pair (instead of to v{0) = 1, v{d) = 0 if a / 0). 

Remark 13. In this problem, rather than estimating the dominant eigenvector of K, our goal 
is estimate the second largest (in magnitude) eigenvalue of K. Given that we know the largest 
eigenvalue of /C is 1 with an eigenvector that has value one in the component corresponding to 
5 = 0 and zeros in all other components, we can therefore exactly orthogonalize the iterates V)™' 
with respect to the dominant eigenvalue at each iteration (by fixing I/)”*(C)) = 0). We may view this 
as using FRI in conjunction with a simple case of orthogonal iteration. 


We compare the FRI and TbS approaches with = 51 for the four- and five-particle systems 
{i = 4,5). The corresponding total count of real numbers needed to represent the solution in the 
five-particle case is 101^® ~ 10^^ so only the 0(1) scheme is reasonable. For h we choose a value 
of 10“^. Our potential V is chosen so that the resulting matrix L (and therefore also K) is sparse 
and its entries are computed by hand. For a more complicated V, the entries of L might have to 
be computed numerically on-the-fly or might not be computable at all. Our ability to efficiently 
compute the entries of L will be strongly effected by the choice of basis. For example, if we use a 
finite difference approximation of L then the entries of L can be computed easily. On the other hand, 
if the solution is reasonably smooth, the finite difference approximation will converge much more 
slowly than an approximation (like the spectral approximation) that more directly incorporates 
properties of the solution (regularity in this case). 

Figure plots the trajectory averages over 10^ iterations for A™ in the i = A case generated 
by the FRI method (iteration (16) using Algorithm along with corresponding trajectories of A™ 
as generated by the TbS approach (iteration ( [I^ using truncation-by-size). We present results 
for both methods with m = 1, 2, 3, and 4 x 10^ Observe that the results from the FRI method 
appear to have converged on one another while the results generated by the TbS approach show no 
signs of convergence. The best (highest m) estimate of the eigenvalue generated by FRI is —2.31 
and the best estimate generated by TbS is —2.49. Figure plots the trajectory of A™ in the five 
particle (£ = 5) case as generated by the FRI method with m = 10® along with its trajectory 
average (neglecting the first 500 iterations) of about —1.3. Note that A™ appears to reach its 
statistical equilibrium rapidly relative to the 2 x 10^ total iterations. Again, the rapid equilibration 
suggests that statistical error could be removed by averaging over many shorter trajectories evolved 
in parallel. 


6.3. A PDF steady state problem The adjoint L* of the operator defined in (50) with respect 


to the standard inner product is called the Fokker-Planck operator. The operator determines the 
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Figure 5: Trajectory averages of the approximation, A™, of the largest negative eigenvalue of a 
backwards Kolmogorov operator for a four 2D-particle (8-dimensional) system, with 95% confidence 
intervals for the FRI method with m = 1, 2, 3, and 4 x 10'^. The operator is discretized using a 
Fourier basis with 101 modes per dimension for a total of more than 10^® basis elements (half that 
after taking advantage of the fact that the desired eigenvector is real). The step-size parameter h 
is set to 10“^. Also on this graph, trajectories of A™ for the TbS method for the same values of m. 


evolution of the density of the process X{t) defined in (51) in the sense that if /r is that density 
then 

dtn = L*fi. 

An element in the kernel of L* (a steady state solution of the Fokker-Planck equation) is a density 
left invariant by X[t). 

For the choice of b and a given in the previous subsection, the steady state solutions are easily 
seen to be constant multiples of the function 

^ J (,-V(x) ■ 

In most applications, the goal is to compute averages of observables with respect to /r*. For 
example, one might hope to find (up to an additive constant) the effective potential (or free-energy) 
experienced by particle 1, 

Fi(x^^^) = — log J ^^{x) ■ ■ ■ dx^^\ 

For that purpose, explicit knowledge of /r* is of little value since one cannot hope to compute 
integrals with respect to a function of so many variables (up to 101^ in our tests). One instead 
hopes to find a more digestible expression for /r*. Notice that if a Fourier expansion 

was available then we could compute 
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Figure 6: Trajectory of the approximation, (solid blue line), of the largest negative eigenvalue 
of a backwards Kolmogorov operator for the five-particle system as computed by the FRI method 
with m = 10® over 2 x 10^ iterations. The total dimension of the discretized system is more than 
10^®. The average value of A™ (ignoring the first 500 iterations) is —1.3 and is shown by a solid 
black line. 



As in the previous section we discretize the Fokker-Planck operator in a Fourier basis resulting 
in a finite-dimensional linear root finding problem 


(K - I) n* = 0 


where K is now defined just as in (52) but with A = + D. We choose to normalize the solution 

so that n((l) = 1 which then results in a linear system 

[K - I) = r 


where r(a) = — AT-g, K has the row and column corresponding to the index 0 removed, and n* has 
the component corresponding to index 0 removed. 


Remark 14. Note that the linear system {K — I)v^ = r is solved for u* here using FRI in conjunction 
with Jacobi iteration. With the normalization nt(()) = 1, this is equivalent to using the power 
iteration to hnd the eigenvector corresponding to the largest eigenvalue of K (which is 1). Recalling 
that here K ^ I + e(L^ -I- D), observe that we are (when e is small) approximately computing 
lim^^oo exp(At)no with A = which, since the largest eigenvalue of A is 0, is the desired 

eigenvector. We repeat that though we know that the dominant eigenvalue of K and we have a 
formula for /i*, the dominant eigenvector of L*, our goal is to compute projections of that cannot 
be computed by deterministic means. 

In Figure]^ we present approximations of the function Fi generated by the 0(1) scheme 

vi!i, = ^T{Kvn, 

F-i(aW) = (KFn («('), 0,...,0) 

= (1 - £t)Ft(aW) + 
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for all G with et = {t + 1)“^ and where the independent mappings are generated 
according to Algorithm The single particle free-energ30 as generated by the FRI approach is 
plotted for £ = 2, 3, 4, and 5 with m = 10, 200, 10^, and 10® respectively. In the two, three, and 
four particle simulations we use 10® iterations. Again we choose A" = 51 and h = 10“^. The high 
cost per iteration in the five particle case restricts our simulations to 2 x 10^ iterations. In the 
four particle case, for which we have validated the FRI solution by simulations with higher values 
of m {m = 4 X 10^), the free energy profile produced by the TbS approach differs from the FRI 
result by as much as 100%. We take slight advantage of the particle exchange symmetry and, at 
each iteration, replace {KVl^) 0,..., 0) in the above equation for by the average of all £ 
components of the form (KV^) (0,..., 0,..., 0). Note that in the expansion of /r*, we know 

that v{d) is unchanged when we swap the indices and corresponding to any two particles. 
This fact could be leveraged to greatly reduce the number of basis functions required to accurately 
represent the solution. We have not exploited this possibility. 

Though it is not accurate, the TbS scheme is substantially more stable on this problem. We 
assume that the relative stability of the TbS scheme is a manifestation of the fact that TbS is not 
actually representing the high wave number modes that are responsible for stability constraints. 
Nonetheless, simulating higher dimensional systems would require modifications in our approach. 
In particular it might be necessary to identify a small number of components of the solution that 
should always be resolved (never set to zero). For example, for this problem one might choose to 
resolve some number of basis functions for each particle that are independent of the positions of 
the other particles. 

7 Discussion 

We have introduced a family of fast, randomized iteration schemes for eigenproblems, linear systems, 
and matrix exponentiation. Traditional iterative methods for numerical linear algebra were created 
in part to deal with instances where the coefficient matrix A (of size 0(n^)) is too big to store but 
where the operation x i— Ax can nonetheless be carried out. The 0(1) iterative methods in this 
paper are intended for instances where even the solution vector x (of size 0(n)) is too big to store 
but where the ultimate goal is to compute Fx where F may be dense but has many fewer rows 
than columns. In addition to the contexts addressed explicitly above, such problems have become 
increasingly common in other areas as well; for example, in the analysis of large data sets and in 
the solution of large-scale Lyapunov and Sylvester equations misiiES] arising in control theory. 

A completely deterministic approach to iterative problems related to the methods proposed in 
this article is the simple thresholding by size (TbS) in which, at each iteration, the smallest entries 
in the approximation are set to zero. An adaptive version of the TbS approach has recently been 
advocated for a wide range of applications (see [5011121113]) including variants of those considered 
in this paper. Like TbS our randomized schemes also rely on the enforcement of sparsity and also 
tend to set small entries in the approximate solution to zero. While TbS schemes can be effective 
on some problems with sparse solutions its error, in general, will be strongly dependent on system 
size. 

The core concept behind the schemes is the notion that, by randomly setting entries in vectors to 
zero, while maintaining a statistical consistency property, we can dramatically reduce the cost and 
storage of standard iterative schemes. One can view our FRI schemes as an attempt to blur the line 
separating Markov chain Monte Carlo (MCMC), which is effective in extremely high-dimensional 

^Note that we only approximate up to the additive constant — log / In fact, the free energy is typically 

only defined up to that constant because it is not uniquely specified (one can add a constant to V without changing 
M*. 
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Particle 1 free-energy 



Figure 7: Free energy landscape experienced by a single particle for the (clockwise from top left) 
two, three, four, and five 2D-particle systems. The surfaces were generated using the FRI method 
with m = 10, 200, 10^, and 10® respectively and h = 10“^. The two-, three-, and four-particle sim¬ 
ulations were run for 10® iterations. The five particle simulation was substantially more expensive 
per iteration and was run for only 2 x 10® iterations. The number of Fourier modes used to represent 
the solution in all simulations is 101 per dimension for a total of more than 10®, 10®^, 10®®, and 10^® 
basis functions (half that after taking advantage of the fact that the solution is real). As expected, 
the free energy basins deepen as the number of particles grows. In the four particle case (for which 
we have high confidence in the estimate produced by FRI) the error from the TbS approach (the 
results of which we do not plot) is roughly 100% at peaks of the free energy landscape. 


settings but is limited to a relatively narrow class of problems and often does not allow the user to 
take full advantage of known properties of the solution (e.g. smoothness or symmetry properties as 
in 0 ), and traditional deterministic schemes, which are effective on a very general set of relatively 
low dimensional problems. As for MCMC approaches, if one settles for computing low dimensional 
projections of the full solution then not every element of the state space need be visited and effective 
FRI schemes with per iteration cost and storage requirements independent of system size can be 
derived (as for MCMC the validity of this statement depends on the particular sequence of problems 
considered). Also as for MCMC we expect that, when deterministic alternatives are available, they 
will outperform our randomized schemes. For matrices of the size considered in all of our numerical 
tests, deterministic alternatives are not available. 

Experience with diffusion Monte Carlo in the context of quantum Monte Carlo simulations 
suggests that our randomized schemes will be most useful if applied after considerable effort has 
been expended on finding changes of variables that either make the desired solution as sparse as 
possible (reducing both bias and variance) or reduce bias by some other means. In many cases this 
will mean applying our randomized schemes only after one has obtained an estimate of the solution 
by some deterministic method applied to a reduced dimensional version of the target problem. 


37 












Acknowledgments 

JQW would like to thank Eric Gances, Tony Lelievre, and Matthias Rousset, for their hospitality 
and helpful discussions during a visit to ENPC that coincided with the early stages of this work. He 
would also like to thank Omiros Papaspiliopoulos for many helpful conversations that informed the 
compression strategies adopted in this article. Both authors would like to thank Alexandre Chorin, 
Petros Drineas, Risi Kondor, Jianfeng Lu, Omiros Papaspiliopoulos, and Panos Stinis, who made 
comments that strongly affected this article’s structure and content. LHL’s work is generously 
supported by NSP IIS-1546413 and DMS-1209136. JQW’s effort on this project was supported by 
the Advance Scientific Computing Research program within the DOE Office of Science through 
award de-sc0014205 as well as through a contract from Argonne, a U.S. Department of Energy 
Office of Science laboratory. 

References 

[1] V.N. Alexandrov and S. Lakka. Comparison of three monte carlo methods for matrix inversion. 
In Luc Bouge, Pierre Eraigniaud, Anne Mignotte, and Yves Robert, editors, Euro-Par’96 
Parallel Processing, volume 1124 of Lecture Notes in Computer Science, pages 72-80. Springer 
Berlin Heidelberg, 1996. 

[2] J. Anderson. A random-walk simulation of the Schrodinger equation: Hj^. J. Chem. Phys., 
63(4):1499-1503, 1975. 

[3] Roi Baer, Daniel Neuhauser, and Eran Rabani. Self-averaging stochastic kohn-sham density- 
functional theory. Phys. Rev. Lett., 111:106402, Sep 2013. 

[4] P. Benner, J.R. Li, and T. Penzl. Numerical solution of large-scale Lyapunov equations, 
Riccati equations, and linear-quadratic optimal control problems. Numerical Linear Algebra 
and Applications, 15(9):755-777, 2008. 

[5] George H. Booth and Ali Alavi. Approaching chemical accuracy using full configuration in¬ 
teraction quantum monte carlo: a study of ionization potentials. J. Chem. Phys., 132:174104, 
2010. 

[6] George H. Booth, D. Gleland, Alex J. W. Thom, and Ali Alavi. Breaking the carbon dimer: 
the challenges of multiple bond dissociation with full configuration interaction quantum Monte 
Garlo methods. J. Chem. Phys., 135:084104, 2011. 

[7] George H. Booth, Andreas Griineis, Georg Kresse, and Ali Alavi. Towards an exact description 
of electronic wavefunctions in real solids. Nature, 493(7432):365-370, 01 2013. 

[8] George H. Booth, Alex J. W. Thom, and Ali Alavi. Eermion monte carlo without fixed nodes: 
A game of life, death, and annihilation in slater determinant space. The Journal of Chemical 
Physics, 131(5), 2009. 

[9] Leon Bottou. Stochastic learning. In Olivier Bousquet and Ulrike von Luxburg, editors. 
Advanced Lectures on Machine Learning, Lecture Notes in Artificial Intelligence, LNAI 3176, 
pages 146-168. Springer Verlag, Berlin, 2004. 

[10] N. Bou-Rabee and E. Vanden-Eijnden. Gontinuous-time random walks for the numerical 
solution of stochastic differential equations. arXiv:1502.05034 [math.PR], 2015. 


38 



[11] D. Ceperley and B. Alder. Ground state of electron gas by a stochastic method. Phys. Rev. 
Lett, 45(7):566-569, 1980. 

[12] Alexandre Chorin. Random choice solution of hyperbolic systems. Journal of Computational 
Physics, 22:517-536, 1976. 

[13] D. Cleland, George H. Booth, and Ali Alavi. Survival of the fittest: accelerating convergence 
in full configuration-interaction quantum Monte Garlo. J. Chem. Phys., 132:041103, 2010. 

[14] D. Gleland, George H. Booth, and Ali Alavi. A study of electron affinities using the initiator 
approach to full configuration interaction quantum Monte Garlo. J. Chem. Phys., 134:024112, 
2011. 

[15] E. Coakley, V. Rokhlin, and M. Tygert. A fast randomized algorithm for orthogonal projection. 
SIAM Journal on Scientific Computing, 33(2):849-868, 2011. 

[16] N. de Freitas, A. Doucet, and N. Gordon (Eds). Sequential Monte Carlo Methods in Practice. 
Springer, 2005. 

[17] Pierre Del Moral. Feynman-Kac formulae. Probability and its Applications (New York). 
Springer-Verlag, New York, 2004. Genealogical and interacting particle systems with applica¬ 
tions. 

[18] James W. Demmel. Applied Numerical Linear Algebra. Society for Industrial and Applied 
Mathematics, 1997. 

[19] P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices i: Ap¬ 
proximating matrix multiplication. SIAM Journal on Computing, 36(1):132-157, 2006. 

[20] P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices ii: Com¬ 
puting a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158-183, 
2006. 

[21] P. Drineas, R. Kannan, and M. Mahoney. Fast Monte Carlo algorithms for matrices hi: 
Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 
36(l):184-206, 2006. 

[22] Weinan E, Bjorn Engquist, Xiantao Li, Weiqing Ren, and Eric Vanden-Eijnden. The heterog- 
nous multiscale methods: A review. Communications in Computational Physics, 2(3):367-450, 
06 2007. 

[23] Sylvester Eriksson-Bique, Mary Solbrig, Michael Stefanelli, Sarah Warkentin, Ralph Abbey, 
and Use C. F. Ipsen. Importance sampling for a Monte Carlo matrix multiplication algorithm, 
with application to information retrieval. SIAM J. Sci. Comput., 33(4):1689-1706, July 2011. 

[24] W. Foulkes, L. Mitas, R. Needs, and G. Rajagopal. Quantum Monte Carlo simulations of 
solids. Rev. Mod. Phys., 73(l):33-79, 2001. 

[25] S. Friedland and L.-H. Lim. The computational complexity of duality, http://www.stat. 
uchicago.edu/~lekheng/work/dual.pdf , 2015. 

[26] Norman H. Fuchs. Approximate solutions for large transfer matrix problems. Journal of 
Computational Physics, 83(1):201 - 211, 1989. 


39 



[27] J. Goodman and N. Madras. Random-walk interpretations of classical iteration methods. 
Linear Algebra Appl, 216;61--79, 1995. 

[28] N. Gordon, D. Salmond, and A. Smith. Novel approach to nonlinear non-Gaussian Bayesian 
state estimation. lEE Proceedings F, 51:107 - 113, 1993. 

[29] R. Grimm and R. Storer. Monte-Carlo solution of Schrodinger’s equation. J. Comp. Phys., 
7(1):134-156, 1971. 

[30] Martin Hairer and Jonathan Weare. Improved diffusion Monte Carlo. Commun. Pure Appl. 
Math., 67:1995—2021, 2014. 

[31] J.M. Hammersley and K.W. Morton. Poor man’s Monte Carlo. J. R. Stat. Soc. B, 16(l):23-38, 
1954. 

[32] M.F. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian 
smoothing splines. Communications in Statistics - Simulation and Computation, 18(3): 1059- 
1076, 1989. 

[33] M. Kalos. Monte Carlo calculations of the ground state of three- and four-body nuclei. Phys. 
Rev., 128(4):1791-1795, 1962. 

[34] G. Kitagawa. Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. 
Journal of Computational and Graphical Statistics, 5:1-25, 1996. 

[35] J. Kolorenc and L. Mitas. Applications of quantum Monte Carlo methods in condensed sys¬ 
tems. Rep. Prog. Phys., 74:1-28, 2011. 

[36] H. Kushner and G. Yin. Stochastic Approximation and Recursive Algorithms and Applications, 
volume 35 of Stochastic Modelling and Applied Probability. Springer, 2nd edition, 2003. 

[37] Edo Liberty, Franco Woolfe, Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. 
Randomized algorithms for the low-rank approximation of matrices. Proceedings of the Na¬ 
tional Academy of Sciences, 104(51):20167-20172, 2007. 

[38] P. Lopez Rfos, A. Ma, N. D. Drummond, M. D. Towler, and R. J. Needs. Inhomogeneous 
backflow transformations in quantum monte carlo calculations. Phys. Rev. E, 74:066701, Dec 
2006. 

[39] Per-Gunnar Martinsson, Vladimir Rokhlin, and Mark Tygert. A randomized algorithm for 
the decomposition of matrices. Applied and Computational Harmonic Analysis, 30(1) :47 - 68, 
2011. 

[40] Eric Moulines and Francis R. Bach. Non-asymptotic analysis of stochastic approximation 
algorithms for machine learning. In J. Shawe-Taylor, R.S. Zemel, P.L. Bartlett, F. Pereira, 
and K.Q. Weinberger, editors. Advances in Neural Information Processing Systems 24, pages 
451-459. Curran Associates, Inc., 2011. 

[41] M. P. Nightingale and H. W. J. Blote. Gap of the linear spin-1 heisenberg antiferromagnet: A 
monte carlo calculation. Phys. Rev. B, 33:659-661, Jan 1986. 

[42] Vidvuds Ozolins, Rongjie Lai, Russel Caflisch, and Stanley Osher. Compressed modes for 
variational problems in mathematics and physics. Proceedings of the National Academy of 
Sciences, 110(46):18368-18373, 2013. 


40 



[43] Vidvuds Ozolins, Rongjie Lai, Russel Caflisch, and Stanley Oslier. Compressed plane waves 
yield a compactly supported multiresolution basis for the laplace operator. Proceedings of the 
National Academy of Sciences, 111(5):1691-1696, 2014. 

[44] B. Parlett and Wee-Liang Heng. The method of minimal representations in 2D Ising model 
calculations. Journal of Computational Physics, 114:257-264, 1994. 

[45] G.A. Pavliotis and A.M. Stuart. Multiscale Methods: Averaging and Homogenization, vol¬ 
ume 53 of Texts in Applied Mathematics. Springer, 2008. 

[46] Vladimir Rokhlin, Arthur Szlam, and Mark Tygert. A randomized algorithm for principal 
component analysis. SIAM J. Matrix Anal. AppL, 31(3):1100-1124, August 2009. 

[47] Vladimir Rokhlin and Mark Tygert. A fast randomized algorithm for overdetermined linear 
least-squares regression. Proceedings of the National Academy of Sciences, 105(36):13212- 
13217, 2008. 

[48] M.N. Rosenbluth and A.W. Rosenbluth. Monte Carlo calculation of the average extension of 
molecular chains. J. Chem. Phys., 23(2):356-359, 1955. 

[49] Mathias Rousset. On the control of an interacting particle approximation of schrodinger ground 
states. SIAM J. Math. Anal, 38(3):824-844, 2006. 

[50] Hayden Schaeffer, Russel Caflisch, Cory D. Hauck, and Stanley Osher. Sparse dynamics for 
partial differential equations. Proceedings of the National Academy of Sciences, 110(17):6634- 
6639, 2013. 

[51] J.J. Sheperd, George H. Booth, A. Griineis, and Ali Alavi. Full configuration interaction 
perspective on the homogeneous electron gas. Phys. Rev. B, 85, 2012. 

[52] V. Simoncini. A new iterative method for solving large-scale Lyapunov matrix equations. 
SIAM J. Sci. Comput., 29(3):1268-1288, 2007. 

[53] V. Simoncini, D.B. Szyld, and M. Monsalve. On two numerical methods for the solution of 
large-scale algebraic Riccati equations. IMA J. Numer. Anal., 34(3):904-920, 2014. 

[54] G. W. Stewart. Matrix Algorithms II: Eigensystems. SIAM, 2001. 

[55] Jonathan Weare. A simple example in C-|—I- of FRI applied to computing the dominant 
eigenvalue of a matrix, http: //dx.doi. org/10.5281/zenodo.31208, 2015. 

[56] Franco Woolfe, Edo Liberty, Vladimir Rokhlin, and Mark Tygert. A fast randomized algorithm 
for the approximation of matrices. Applied and Computational Harmonic Analysis, 25(3):335 
- 366, 2008. 


41 


